scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Visualization of Regression Models Using visreg

01 Jan 2017-R Journal (The R Foundation)-Vol. 9, Iss: 2, pp 56-71
TL;DR: An R package, visreg, is introduced for the convenient visualization of this relationship between an outcome and an explanatory variable via short, simple function calls and provides pointwise condence bands and partial residuals to allow assessment of variability as well as outliers and other deviations from modeling assumptions.
Abstract: Regression models allow one to isolate the relationship between the outcome and an explanatory variable while the other variables are held constant. Here, we introduce an R package, visreg, for the convenient visualization of this relationship via short, simple function calls. In addition to estimates of this relationship, the package also provides pointwise condence bands and partial residuals to allow assessment of variability as well as outliers and other deviations from modeling assumptions. The package provides several options for visualizing models with interactions, including lattice plots, contour plots, and both static and interactive perspective plots. The implementation of the package is designed to be fully object-oriented and interface seamlessly with R’s rich collection of model classes, allowing a consistent interface for visualizing not only linear models, but generalized linear models, proportional hazards models, generalized additive models, robust regression models, and many more.

Content maybe subject to copyright    Report

CONTRIBUTED RESEARCH ARTICLE 56
Visualization of Regression Models
Using visreg
by Patrick Breheny and Woodrow Burchett
Abstract
Regression models allow one to isolate the relationship between the outcome and an ex-
planatory variable while the other variables are held constant. Here, we introduce an R package,
visreg
, for the convenient visualization of this relationship via short, simple function calls. In addition
to estimates of this relationship, the package also provides pointwise confidence bands and partial
residuals to allow assessment of variability as well as outliers and other deviations from modeling
assumptions. The package provides several options for visualizing models with interactions, including
lattice plots, contour plots, and both static and interactive perspective plots. The implementation of
the package is designed to be fully object-oriented and interface seamlessly with R’s rich collection of
model classes, allowing a consistent interface for visualizing not only linear models, but generalized
linear models, proportional hazards models, generalized additive models, robust regression models,
and many more.
Introduction
In simple linear regression, it is both straightforward and extremely useful to plot the regression line.
The plot tells you everything you need to know about the model and what it predicts. It is common to
superimpose this line over a scatter plot of the two variables. A further refinement is the addition of
a confidence band. Thus, in one plot, the analyst can immediately assess the empirical relationship
between
x
and
y
in addition to the relationship estimated by the model and the uncertainty in that
estimate, and also assess how well the two agree and whether assumptions may be violated.
Multiple regression models address a more complicated question: what is the relationship between
an explanatory variable and the outcome as the other explanatory variables are held constant? This
relationship is just as important to visualize as the relationship in simple linear regression, but doing
so is not nearly as common in statistical practice.
As models get more complicated, it becomes more difficult to construct these sorts of plots. With
multiple variables, we cannot simply plot the observed data, as this does not hold the other variables
constant. Interactions among variables, transformations, and non-linear relationships all add extra
barriers, making it time-consuming for the analyst to construct these plots. This is unfortunate,
however as models grow more complex, there is an even greater need to represent them with clear
illustrations.
In this paper, we aim to eliminate the hurdle of implementation through the development of a
simple interface for visualizing regression models arising from a wide class of models: linear models,
generalized linear models, robust regression models, additive models, proportional hazards models,
and more. We implement this interface in R and provide it as the package
visreg
, publicly available
from the Comprehensive R Archive Network. The purpose of the package is to automate the work
involved in plotting regression functions, so that after fitting one of the above types of models, the
analyst can construct attractive and illustrative plots with simple, one-line function calls. In particular,
visreg
offers several tools for the visualization of models containing interactions, which are among the
easiest to misinterpret and the hardest to explain.
It is worth noting that there are two distinct goals involved in plotting regression models: illustrat-
ing the fitted model visually and diagnosing violations of model assumptions through examination of
residuals. The approach taken by
visreg
is to construct a single plot that simultaneously addresses
both goals. This is not a new idea. Indeed, this project was inspired by the work of Trevor Hastie,
Robert Tibshirani, and Simon Wood, who have convincingly demonstrated the utility of these types of
plots in the context of generalized additive models (Hastie and Tibshirani, 1990; Wood, 2006).
In particular,
visreg
offers partial residuals, which can be defined for any regression model and are
easily superimposed on visualization plots. Partial residuals are widely useful in detecting many types
of problems, although several authors have pointed out that they are not without limitations (Mallows,
1986; Cook, 1993). Various extensions and modifications of partial residuals have been proposed,
and there is an extensive literature on regression diagnostics (Belsley et al., 1980; Cook and Weisberg,
1982); indeed, many diagnostics are specific to the type of model (e.g., Pregibon, 1981; Grambsch and
Therneau, 1994; Loy and Hofmann, 2013). Partial residuals are a useful, easily generalized idea that
can applied to virtually any type of model although it is certainly worth being aware of other types of
diagnostics that are specific to the modeling framework in question.
There are a number of R packages that offer functions for visualizing regression models, including
The R Journal Vol. 9/2, December 2017 ISSN 2073-4859

CONTRIBUTED RESEARCH ARTICLE 57
rms (Harrell, 2015), rockchalk (Johnson, 2016), car (Fox and Weisberg, 2011), effects (Fox, 2003), and,
in base R, the
termplot
function. The primary advantage of
visreg
over these alternatives is that each
of them is specific to visualizing a certain class of model, usually
lm
or
glm
.
visreg
, by virtue of its
object-oriented approach, works with any model that provides a
predict
method meaning that it
can be used with hundreds of different R packages as well as user-defined model classes. We also feel
that
visreg
offers a simpler interface and produces nicer-looking plots, but admit that beauty is in the
eye of the beholder. Nevertheless, there are situations in which each of these packages are very useful
and offer some features that others do not, such as greater flexibility for other types of residuals (car)
and better support for visualizing three-way interactions (effects).
Each type of model has different mathematical details. All models, however, describe how the
response is expected to vary as a function of the explanatory variables. In R, this is implemented for an
extensive catalog of models that provide an associated
predict
method. Although there are no explicit
rules forcing programmers to write
predict
methods for a given class in a consistent manner, there
is a widely agreed-upon convention to follow the general syntax of
predict.lm
. It is this abstraction
upon which
visreg
is based: the use of object-oriented programming to provide a single tool with a
consistent interface for the convenient visualization of a wide array of models.
There are thousands of R packages, many of which provide an implementation of some type of
model. It is impossible for any programmer or team of programmers to write an R package that is
familiar with the details of all of them. However, the encapsulation and abstraction offered by an
object-oriented programming language allow for an elegant solution to this problem. By passing a
fitted model object to
visreg
, we can call the
predict
method provided by that model class to obtain
appropriate predictions and standard errors without needing to know any of the details concerning
how those calculations work for that type of model; the same applies to construction of residuals
through the residual method.
The only other R package that we are aware of that provides this kind of object-oriented flexibility
is
plotmo
by Stephen Milborrow. The
visreg
and
plotmo
projects were each started independently
around the year 2011 and have developed into mature, widely used packages for model visualization.
The organization and syntax of the packages is quite different, but both are based on the idea of using
the generic
predict
and
residuals
methods provided by a model class to offer a single interface
capable of visualizing virtually any type of model. The primary difference between the two packages
is that
plotmo
separates the visualization of models and the plotting of residuals, constructed using
the
plotmo()
and
plotres()
functions, respectively, while as mentioned earlier,
visreg
combines the
two into a single plot (
plotmo
offers an option to superimpose the unadjusted response onto a plot,
but this is very different from plotting partial residuals). Furthermore, as one would expect, each
package offers a few options that the other does not. For example,
plotmo
offers the ability to construct
partial dependence plots (Hastie et al., 2009), while
visreg
offers options for contrast plots and what
we call “cross-sectional” plots (Figs. 6, 7, and 8). Broadly speaking,
plotmo
is somewhat more oriented
towards machine learning-type models, while
visreg
is more oriented towards regression models,
though both packages can be used for either purpose. In particular,
plotmo
supports the
X,y
syntax
used by packages like
glmnet
, which is more popular among machine learning packages, while
visreg
focuses exclusively on models that use a formula-based interface.
The outline of the paper is as follows. In “Conditional and contrast plots”, we explicitly define
the relevant mathematical details for what appears in
visreg
’s plots. The remainder of the article is
devoted to illustrating the interface and results produced by the software in three extensions of simple
linear regression: multiple (additive) linear regression models, models that possess interactions, and
finally, other sorts of models, such as generalized linear models, proportional hazards models, random
effect models, random forests, etc.
Conditional and contrast plots
We begin by considering regression models, where all types of
visreg
plots are well-developed and
clearly defined. At the end of this section, we describe how these ideas can be extended generically to
any model capable of making predictions.
In a regression model, the relationship between the outcome and the explanatory variables is
expressed in terms of a linear predictor η:
η = Xβ =
j
x
j
β
j
, (1)
where
x
j
is the
j
th column of the design matrix
X
. For the sake of clarity, we focus in this section on
linear regression, in which the expected value of the outcome
E(Y
i
)
equals
η
i
; extensions to other,
nonlinear models are discussed in “Other models”. In the absence of interactions (see “Linear models
The R Journal Vol. 9/2, December 2017 ISSN 2073-4859

CONTRIBUTED RESEARCH ARTICLE 58
with interactions”), the relationship between
X
j
and
Y
is neatly summarized by
β
j
, which expresses
the amount by which the expected value of Y changes given a one-unit change in X
j
.
Partial residuals are a natural multiple regression analog to plotting the observed
x
and
y
in simple
linear regression. Partial residuals were developed by Ezekiel (1924), rediscovered by Larsen and
McCleary (1972), and have been discussed in numerous papers and textbooks ever since (Wood, 1973;
Atkinson, 1982; Kutner et al., 2004). Letting
r
denote the vector of residuals for a given model fit, the
partial residuals belonging to variable j are defined as
r
j
= y X
j
b
β
j
(2)
= r + x
j
b
β
j
, (3)
where the
j
subscript refers to the portion of
X
or
β
that remains after the
j
th column/element is
removed.
The reason partial residuals are a natural extension to the multiple regression setting is that the
slope of the simple linear regression of
r
j
on
x
j
is equal to the value
b
β
j
that we obtain from the multiple
regression model (Larsen and McCleary, 1972).
Thus, it would seem straightforward to visualize the relationship between
X
j
and
Y
by plotting a
line with slope
β
j
through the partial residuals. Clearly, however, we may add any constant to the line
and to
r
j
and the above result would still hold. Nor is it obvious how the confidence bands should be
calculated.
We consider asking two subtly different questions about the relationship between X
j
and Y:
(1) What is the relationship between E(Y) and X
j
given x
j
= x
j
?
(2) How do changes in X
j
relative to a reference value x
j
affect E(Y)?
The biggest difference between the two questions is that the first requires specification of some
x
j
,
whereas the second does not. The reward for specifying
x
j
is that specific values for the predicted
E(Y)
may be plotted on the scale of the original variable
Y
; the latter type of plot can address only
relative changes. Here, we refer to the first type of plot as a conditional plot, and the second type as
a contrast plot. As we will see, the two questions produce regression lines with identical slopes, but
with different intercepts and confidence bands. It is worth noting that these are not the only possible
questions; other possibilities, such as “What is the marginal relationship between
X
j
and
Y
, integrating
over X
j
?” exist, although we do not explore them here.
For a contrast plot, we consider the effect of changing
X
j
away from an arbitrary point
x
j
; the
choice of
x
j
thereby determines the intercept, as the line by definition passes through
(x
j
, 0
)
. The
equation of this line is
y = (x x
j
)
b
β
j
. For a continuous
X
j
, we set
x
j
equal to
¯
x
j
. The confidence
interval at the point x
j
= x is based on
V(x) = V
n
ˆ
η(x)
ˆ
η(x
j
)
o
= (x x
j
)
2
V(
b
β
j
).
When
X
j
is categorical, we plot differences between each level of the factor and the reference category
(see Figure 3 for an example); in this case, we are literally plotting contrasts in the classical ANOVA
sense of the term (hence the name). Our usage of the term “contrast” for continuous variables is
somewhat looser, but still logical in the sense that it estimates the contrast between a value of
X
j
and
the reference value.
For a conditional plot, on the other hand, all explanatory variables are fully specified by
x
and
x
j
.
Let
λ(x)
T
denote the row of the design matrix that would be constructed from
x
j
= x
and
x
j
. Then
the equation of the line is y = λ(x)
T
b
β and the confidence interval at x is based on
V(x) = V
n
λ(x)
T
b
β
o
= λ(x)
T
V(
b
β)λ(x).
In both conditional and contrast plots, the confidence interval at
x
is then formed around the
estimate in the usual manner by adding and subtracting
t
np,1α /2
p
V(x)
, where
t
np,1α /2
is 1
α/
2
quantile of the
t
distribution with
n p
degrees of freedom. Examples of contrast plots and conditional
plots are given in Figures 2 and 3. Both plots depict the same relationship between wind and ozone
level as estimated by the same model (details given in the following section). Note the difference,
however, in the vertical scale and confidence bands. In particular, the confidence interval for the
contrast plot has zero width at
x
j
; all other things remaining the same, if we do not change
X
j
, we can
say with certainty that
E(Y)
will not change either. There is still uncertainty, however, regarding the
The R Journal Vol. 9/2, December 2017 ISSN 2073-4859

CONTRIBUTED RESEARCH ARTICLE 59
actual value of
E(Y)
, which is illustrated in the fact that the confidence interval of the conditional plot
has positive width everywhere.
This description of confidence intervals focuses on Wald-type confidence intervals of the form
of estimate
±
multiple of the standard error, constructed on the scale of the linear predictor. This
is the most common type of interval provided by modeling packages in R, and the only one for
which a widely agreed-upon, object-oriented consensus has emerged in terms of what the
predict
method returns. For this reason, this is usually the only type of interval available for plotting by
visreg
. However, it should be noted that these intervals are common for their convenience, not due to
superiority; it is typically the case that more accurate confidence intervals exist (see, for example, Efron,
1987; Withers and Nadarajah, 2012). In principle, one could plot other types of intervals, but
visreg
does not calculate intervals itself so much as plot the intervals that the modeling package returns.
Thus, unless the modeling package provides methods for calculating other types of intervals,
visreg
is
restricted to plotting Wald intervals.
Contrast plots can only be constructed for regression-based models, as they explicitly require an
additive decomposition in terms of a design matrix and coefficients. Conditional plots, however, can
be constructed for any model that produces predictions. Denote this prediction
f (x)
, where
x
is a
vector of predictors for the model. Writing this as a one-dimensional function of predictor
j
with the
remaining predictors fixed at x
j
, let us express this prediction as f (x|x
j
). In a conditional plot, the
partial residuals for predictor j are
r
j
= r + x
j
b
β
j
+ x
j
b
β
j
= r + f (x|x
j
),
which offers a clear procedure for constructing the equivalent of partial residual for general prediction
models. Note that this construction requires the model class to implement a
residuals
method. If a
model class lacks a
residuals
method,
visreg
will still produce a plot, but must omit the partial residu-
als; see “Non-regression models” for additional details. Likewise,
visreg
requires the
predict
method
for the model class to return standard errors in order to plot confidence intervals; see “Hierarchical
and random effect models” for an example in which standard errors are not returned.
It is worth mentioning that
visreg
is only concerned with confidence bands for the conditional
mean
E(Y|X)
, not “prediction intervals” that have a specified probability of containing a future
outcome
Y
observed for a certain value of
X
. Unlike standard errors for the mean, very few model
classes in R offer methods for calculating such intervals indeed, such intervals are often not well-
defined outside of classical linear models.
Additive linear models
We are now ready to describe the basic framework and usage of
visreg
. In this section, we will
fit various models to a data set involving the relationship between air quality (in terms of ozone
concentration) and various aspects of weather in the standard R data set airquality.
Basic framework
The basic interface to the package is the function
visreg
, which requires only one argument: the fitted
model object. So, for example, the following code produces Figure 1:
fit <- lm(Ozone ~ Solar.R + Wind + Temp, data=airquality)
visreg(fit)
By default,
visreg
provides conditional plots for each of the explanatory variables in the model.
For the conditioning, the other variables in
x
j
are set to their median for numeric variables and to
the most common category for factors. All of these options can be modified by passing additional
arguments to visreg. For example, contrast plots can be obtained with the
type
argument; the following
code produces Figure 2.
visreg(fit, "Wind", type="contrast")
visreg(fit, "Wind", type="conditional")
The second argument specifies the explanatory variable to be visualized; note that the right plot in
Figure 2 is the same as the middle plot in Figure 1.
In addition to continuous explanatory variables,
visreg
also allows the easy visualization of
differences between the levels of categorical variables (factors). The following block of code creates a
The R Journal Vol. 9/2, December 2017 ISSN 2073-4859

CONTRIBUTED RESEARCH ARTICLE 60
0 50 150 250
0
20
40
60
80
100
120
140
Solar.R
Ozone
5 10 15 20
0
50
100
150
Wind
Ozone
60 70 80 90
0
50
100
150
Temp
Ozone
Figure 1:
Basic output of
visreg
for an additive linear model: conditional plots for each explanatory
variable.
5 10 15 20
−50
0
50
100
Wind
Ozone
5 10 15 20
0
50
100
150
Wind
Ozone
Figure 2:
The estimated relationship between wind and ozone concentration in the same model, as
illustrated by two different types of plots. Left: Contrast plot. Right: Conditional plot.
factor called
Heat
by discretizing
Temp
, and then visualizes its relationship with
Ozone
, producing the
plot in Figure 3.
airquality$Heat <- cut(airquality$Temp, 3, labels=c("Cool", "Mild", "Hot"))
fit.heat <- lm(Ozone ~ Solar.R + Wind + Heat, data=airquality)
visreg(fit.heat, "Heat", type="contrast")
visreg(fit.heat, "Heat", type="conditional")
−20
0
20
40
60
80
100
120
Heat
Ozone
Cool Mild Hot
0
20
40
60
80
100
120
140
Heat
Ozone
Cool Mild Hot
Figure 3:
Visualization of a regression function involving a categorical explanatory variable. Left:
Contrast plot. Right: Conditional plot.
Again, note that the confidence interval for the contrast plot has zero width for the reference
category. There is no uncertainty about how the expected value of ozone will change if we remain at
the same level of
Heat
; it is zero by definition. On the other hand, the width of the confidence interval
for
Mild
heat is wider for the contrast plot than it is for the conditional plot. There is less uncertainty
about the expected value of ozone on a mild day than there is about the difference in expected values
between mild and cool days.
The R Journal Vol. 9/2, December 2017 ISSN 2073-4859

Citations
More filters
Journal ArticleDOI
TL;DR: In this article, the authors examined global N and P limitation using the ratio of site-averaged leaf resorption efficiencies of the dominant species across 171 sites and evaluated their predictions using a global database of N- and P-limitation experiments based on nutrient additions at 106 and 53 sites, respectively.
Abstract: Nitrogen (N) and phosphorus (P) limitation constrains the magnitude of terrestrial carbon uptake in response to elevated carbon dioxide and climate change. However, global maps of nutrient limitation are still lacking. Here we examined global N and P limitation using the ratio of site-averaged leaf N and P resorption efficiencies of the dominant species across 171 sites. We evaluated our predictions using a global database of N- and P-limitation experiments based on nutrient additions at 106 and 53 sites, respectively. Globally, we found a shift from relative P to N limitation for both higher latitudes and precipitation seasonality and lower mean annual temperature, temperature seasonality, mean annual precipitation and soil clay fraction. Excluding cropland, urban and glacial areas, we estimate that 18% of the natural terrestrial land area is significantly limited by N, whereas 43% is relatively P limited. The remaining 39% of the natural terrestrial land area could be co-limited by N and P or weakly limited by either nutrient alone. This work provides both a new framework for testing nutrient limitation and a benchmark of N and P limitation for models to constrain predictions of the terrestrial carbon sink. Spatial patterns in the phosphorus and nitrogen limitation in natural terrestrial ecosystems are reported from analysis of a global database of the resorption efficiency of nutrients by leaves.

426 citations

Journal ArticleDOI
03 Mar 2017-Science
TL;DR: Analysis of plant distributions, archaeological sites, and environmental data indicates that modern tree communities in Amazonia are structured to an important extent by a long history of plant domestication by Amazonian peoples.
Abstract: The extent to which pre-Columbian societies altered Amazonian landscapes is hotly debated. We performed a basin-wide analysis of pre-Columbian impacts on Amazonian forests by overlaying known archaeological sites in Amazonia with the distributions and abundances of 85 woody species domesticated by pre-Columbian peoples. Domesticated species are five times more likely than nondomesticated species to be hyperdominant. Across the basin, the relative abundance and richness of domesticated species increase in forests on and around archaeological sites. In southwestern and eastern Amazonia, distance to archaeological sites strongly influences the relative abundance and richness of domesticated species. Our analyses indicate that modern tree communities in Amazonia are structured to an important extent by a long history of plant domestication by Amazonian peoples.

398 citations

Journal ArticleDOI
TL;DR: Temperature-dependent transmission based on a mechanistic model is an important predictor of human transmission occurrence and incidence in tropical and subtropical regions and in temperate areas even if vectors are present.
Abstract: Recent epidemics of Zika, dengue, and chikungunya have heightened the need to understand the seasonal and geographic range of transmission by Aedes aegypti and Ae. albopictus mosquitoes. We use mechanistic transmission models to derive predictions for how the probability and magnitude of transmission for Zika, chikungunya, and dengue change with mean temperature, and we show that these predictions are well matched by human case data. Across all three viruses, models and human case data both show that transmission occurs between 18-34°C with maximal transmission occurring in a range from 26-29°C. Controlling for population size and two socioeconomic factors, temperature-dependent transmission based on our mechanistic model is an important predictor of human transmission occurrence and incidence. Risk maps indicate that tropical and subtropical regions are suitable for extended seasonal or year-round transmission, but transmission in temperate areas is limited to at most three months per year even if vectors are present. Such brief transmission windows limit the likelihood of major epidemics following disease introduction in temperate zones.

392 citations

Journal ArticleDOI
03 Sep 2020-Cell
TL;DR: This study provides a framework for interrogating how complex biological processes, such as antitumoral immunity, occur through concerted actions of cells and spatial domains in effective versus ineffective tumor control.

350 citations


Cites background or methods from "Visualization of Regression Models ..."

  • ...The partial residual plot in Figure 6J was created using the visreg R package (Breheny and Burchett, 2013)....

    [...]

  • ...…et al., 2010 Visreg R package https://cran.r-project.org/web/ packages/visreg/index.html Breheny and Burchett, 2013 Deldir R package https://cran.r-project.org/web/ packages/deldir/index.html N/A ComplexHeatmap R package…...

    [...]

Journal ArticleDOI
TL;DR: A short-term intervention with an isocaloric low-carbohydrate diet with increased protein content in obese subjects with NAFLD and the resulting alterations in metabolism and the gut microbiota are characterized using a multi-omics approach to highlight the potential of exploring diet-microbiota interactions for treatingNAFLD.

305 citations

References
More filters
Journal Article
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
Abstract: Copyright (©) 1999–2012 R Foundation for Statistical Computing. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the R Core Team.

272,030 citations

Book
01 Jan 1989
TL;DR: Hosmer and Lemeshow as discussed by the authors provide an accessible introduction to the logistic regression model while incorporating advances of the last decade, including a variety of software packages for the analysis of data sets.
Abstract: From the reviews of the First Edition. "An interesting, useful, and well-written book on logistic regression models... Hosmer and Lemeshow have used very little mathematics, have presented difficult concepts heuristically and through illustrative examples, and have included references."- Choice "Well written, clearly organized, and comprehensive... the authors carefully walk the reader through the estimation of interpretation of coefficients from a wide variety of logistic regression models . . . their careful explication of the quantitative re-expression of coefficients from these various models is excellent." - Contemporary Sociology "An extremely well-written book that will certainly prove an invaluable acquisition to the practicing statistician who finds other literature on analysis of discrete data hard to follow or heavily theoretical."-The Statistician In this revised and updated edition of their popular book, David Hosmer and Stanley Lemeshow continue to provide an amazingly accessible introduction to the logistic regression model while incorporating advances of the last decade, including a variety of software packages for the analysis of data sets. Hosmer and Lemeshow extend the discussion from biostatistics and epidemiology to cutting-edge applications in data mining and machine learning, guiding readers step-by-step through the use of modeling techniques for dichotomous data in diverse fields. Ample new topics and expanded discussions of existing material are accompanied by a wealth of real-world examples-with extensive data sets available over the Internet.

35,847 citations

Journal ArticleDOI
TL;DR: Applied Logistic Regression, Third Edition provides an easily accessible introduction to the logistic regression model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables.
Abstract: \"A new edition of the definitive guide to logistic regression modeling for health science and other applicationsThis thoroughly expanded Third Edition provides an easily accessible introduction to the logistic regression (LR) model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables. Applied Logistic Regression, Third Edition emphasizes applications in the health sciences and handpicks topics that best suit the use of modern statistical software. The book provides readers with state-of-the-art techniques for building, interpreting, and assessing the performance of LR models. New and updated features include: A chapter on the analysis of correlated outcome data. A wealth of additional material for topics ranging from Bayesian methods to assessing model fit Rich data sets from real-world studies that demonstrate each method under discussion. Detailed examples and interpretation of the presented results as well as exercises throughout Applied Logistic Regression, Third Edition is a must-have guide for professionals and researchers who need to model nominal or ordinal scaled outcome variables in public health, medicine, and the social sciences as well as a wide range of other fields and disciplines\"--

30,190 citations


"Visualization of Regression Models ..." refers methods in this paper

  • ...We begin with a logistic regression model applied to a study investigating risk factors associated with low birth weight (Hosmer and Lemeshow, 2000)....

    [...]

  • ...Generalized linear models We begin with a logistic regression model applied to a study investigating risk factors associated with low birth weight (Hosmer and Lemeshow, 2000)....

    [...]

Book
13 Aug 2009
TL;DR: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics.
Abstract: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics. With ggplot2, its easy to: produce handsome, publication-quality plots, with automatic legends created from the plot specification superpose multiple layers (points, lines, maps, tiles, box plots to name a few) from different data sources, with automatically adjusted common scales add customisable smoothers that use the powerful modelling capabilities of R, such as loess, linear models, generalised additive models and robust regression save any ggplot2 plot (or part thereof) for later modification or reuse create custom themes that capture in-house or journal style requirements, and that can easily be applied to multiple plots approach your graph from a visual perspective, thinking about how each component of the data is represented on the final plot. This book will be useful to everyone who has struggled with displaying their data in an informative and attractive way. You will need some basic knowledge of R (i.e. you should be able to get your data into R), but ggplot2 is a mini-language specifically tailored for producing graphics, and youll learn everything you need in the book. After reading this book youll be able to produce graphics customized precisely for your problems,and youll find it easy to get graphics out of your head and on to the screen or page.

29,504 citations

Book
28 Jul 2013
TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Abstract: During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

19,261 citations