scispace - formally typeset
Open AccessJournal ArticleDOI

A Systematic Review on the Practice of Evaluating Visualization

TLDR
An assessment of the state and historic development of evaluation practices as reported in papers published at the IEEE Visualization conference found that evaluations specific to assessing resulting images and algorithm performance are the most prevalent and generally the studies reporting requirements analyses and domain-specific work practices are too informally reported.
Abstract
We present an assessment of the state and historic development of evaluation practices as reported in papers published at the IEEE Visualization conference. Our goal is to reflect on a meta-level about evaluation in our community through a systematic understanding of the characteristics and goals of presented evaluations. For this purpose we conducted a systematic review of ten years of evaluations in the published papers using and extending a coding scheme previously established by Lam et al. [2012]. The results of our review include an overview of the most common evaluation goals in the community, how they evolved over time, and how they contrast or align to those of the IEEE Information Visualization conference. In particular, we found that evaluations specific to assessing resulting images and algorithm performance are the most prevalent (with consistently 80-90% of all papers since 1997). However, especially over the last six years there is a steady increase in evaluation methods that include participants, either by evaluating their performances and subjective feedback or by evaluating their work practices and their improved analysis and reasoning capabilities using visual tools. Up to 2010, this trend in the IEEE Visualization conference was much more pronounced than in the IEEE Information Visualization conference which only showed an increasing percentage of evaluation through user performance and experience testing. Since 2011, however, also papers in IEEE Information Visualization show such an increase of evaluations of work practices and analysis as well as reasoning using visual tools. Further, we found that generally the studies reporting requirements analyses and domain-specific work practices are too informally reported which hinders cross-comparison and lowers external validity.

read more

Content maybe subject to copyright    Report

HAL Id: hal-00846775
https://hal.inria.fr/hal-00846775
Submitted on 25 Aug 2013
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
A Systematic Review on the Practice of Evaluating
Visualization
Tobias Isenberg, Petra Isenberg, Jian Chen, Michael Sedlmair, Torsten Möller
To cite this version:
Tobias Isenberg, Petra Isenberg, Jian Chen, Michael Sedlmair, Torsten Möller. A Systematic Re-
view on the Practice of Evaluating Visualization. IEEE Transactions on Visualization and Com-
puter Graphics, Institute of Electrical and Electronics Engineers, 2013, 19 (12), pp.2818-2827.
�10.1109/TVCG.2013.126�. �hal-00846775�

A Systematic Review on the Practice of Evaluating Visualization
Tobias Isenberg, Senior Member, IEEE, Petra Isenberg, Jian Chen, Member, IEEE,
Michael Sedlmair, Member, IEEE, and Torsten M
¨
oller, Senior Member, IEEE
Abstract
—We present an assessment of the state and historic development of evaluation practices as reported in papers published
at the IEEE Visualization conference. Our goal is to reflect on a meta-level about evaluation in our community through a systematic
understanding of the character istics and goals of presented evaluations. For this pur pose we conducted a systematic review of
ten years of evaluations in the published papers using and extending a coding scheme previously established by Lam et al. [2012].
The results of our review include an overview of the most common evaluation goals in the community, how they evolved over time,
and how they contrast or align to those of the IEEE Information Visualization conference. In particular, we found that evaluations
specific to assessing resulting images and algorithm performance are the most prevalent (with consistently 80–90% of all papers since
1997). However, especially over the last six years there is a steady increase in evaluation methods that include participants, either by
evaluating their performances and subjective feedback or by evaluating their work practices and their improved analysis and reasoning
capabilities using visual tools. Up to 2010, this trend in the IEEE Visualization conference was much more pronounced than in the
IEEE Information Visualization conference which only showed an increasing percentage of evaluation through user performance and
experience testing. Since 2011, however, also papers in IEEE Information Visualization show such an increase of evaluations of work
practices and analysis as well as reasoning using visual tools. Further, we found that generally the studies reporting requirements
analyses and domain-specific work practices are too informally reported which hinders cross-comparison and lowers external validity.
Index Terms—Evaluation, validation, systematic review, visualization, scientific visualization, information visualization
1 MOTIVATION
In this paper, we report a systematic review of 581 papers from ten years
of IEEE Visualization conference publications with respect to their use
of evaluation. We provide a quantitative and objective report of the
types of evaluations encountered in the literature. At the same time, we
also qualitatively assess our observations from coding these 581 papers.
Specifically, we put evaluation practices into historic perspective and
assess and compare them in context to those of the larger visualization
community. Our goal in pursuing this work is to get an understanding
of the practices of evaluation in visualization research as a whole.
The importance of evaluation to the field of visualization has become
well recognized—demonstrated by the growing body of work on how to
conduct visualization evaluation and by the growing amount of research
papers that incorporate some form of formal or informal evaluation. In
this article we contribute to the body of work by providing a systematic
assessment and understanding of the evaluation practices reflected by
published peer-reviewed visualization papers that have not been subject
to such a systematic assessment in the past.
Our work is based on Lam et al.’s [
38
] recent literature analysis,
in which they identified seven evaluation scenarios in visualization
research articles. Their paper is an important contribution but does not
reflect on the entire visualization community. It focuses on what is
known as the ‘information visualization’ sub-community and excludes
all other visualization flavors. While Lam et al. primarily focused on
identifying evaluation scenarios, our goal with this paper is different.
We aim to complete the assessment for the larger visualization commu-
nity by answering the question: What are evaluation practices in the
‘scientific visualization’ part of our community? What are similarities
Tobias Isenberg is with INRIA, France. E-mail: tobias.isenberg@inria.fr .
Petra Isenberg is with INRIA, France. E-mail: petra.isenberg@inria.fr .
Jian Chen is with the University of Maryland, Baltimore County, USA.
E-mail: jichen@umbc.edu .
Michael Sedlmair is with the University of Vienna, Austria. E-mail:
michael.sedlmair@univie.ac.at .
Torsten M
¨
oller is with the University of Vienna, Austria. E-mail:
torsten.moeller@univie.ac.at .
Manuscript received 31 March 2013; accepted 1 August 2013; posted online
13 October 2013; mailed on 4 October 2013.
For information on obtaining reprints of this article, please send
e-mail to: tvcg@computer.org.
and differences between these sub-communities? To do so, we use and
extend Lam et al.’s scenarios to systematically analyze the literature
that appeared at the IEEE Visualization conference. We believe that
our extended work is fundamental to understanding all subcultures in
visualization and to properly sample all aspects of visualization work,
not only those labeled as ‘information visualization.
By looking at the historic record, we were hoping to uncover some
trends by examining how the field of visualization has been changing
over the last 15 years. We were wondering whether some of the self-
reflection by some of the field’s leaders in the early 2000’s has left
its mark on our community and whether it led to more rigor in our
evaluations. Likewise, our work is an opportunity to compare the IEEE
Information Visualization and IEEE Visualization conferences to better
understand their differences and commonalities. Our analysis of evalu-
ation methods in visualization exposed a number of both weaknesses
and strengths from which we, as a community, can learn for future
work. Hence, we not only describe the current evaluation practices but
also show what evaluation types are possible and how to improve their
reporting in visualization papers. We thus expose exemplary papers
and discuss a number of pitfalls that should be avoided.
In summary, the contributions of our paper are threefold. First, we
objectively report the current evaluation practices in the visualization
community. This is a quantitative report, focusing on the works in
the IEEE Visualization conference, complementing the work done by
Lam et al. [
38
]. Second, we give a historical overview of the use of
evaluation in the visualization community as reported in the IEEE
Information Visualization and IEEE Visualization conferences and put
evaluation practices into perspective. This is a qualitative assessment
and provides a historical perspective by comparing current and past
evaluation practices. And, third, we provide information for researchers
conducting evaluation by assisting them to identify, justify, and refine
evaluation approaches as well as helping them to recognize and avoid
pitfalls that can be learned from previous research.
2 FUNDAMENTALS AND RELATED WORK
There are two traditions of evaluation that the visualization community
draws from—evaluation in the sciences (both social and natural) and
evaluation in design. On the one hand, scientists try to understand the
world and seek a representative model, often a mathematical model
(e. g., Newton’s law or Fitts’ law), while designers and engineers intro-
duce a tool and henceforth seek to alter the world in which they live and

with which they interact. Science is concerned with model validation
and reproducibility. In addition, in the computational sciences, the
mathematical model is turned into a computer algorithm. This invokes
challenges of verifying the algorithm based on the mathematical model.
For designers, the focus is what is called a ‘user, putting the empha-
sis on the ‘human-in-the-loop. Hence, aspects of tool functionality,
usability, and aesthetics are of concern.
2.1 Validation, Verification, and Reproducibility
In computational science, validation refers to the process of ensuring
the correctness of a conceptual or mathematical model with the salient
aspects of reality [
4
]. In contrast, verification refers to the process
of determining the accuracy of an algorithmic implementation with
respect to the mathematical model. It is important to point out the
dilemma in science that theories and models cannot be validated, only
invalidated. Hence, the process of validation and verification tends to
be a difficult one and of empirical nature. It is thus common to test
one’s algorithms and models on a number of well-chosen test cases. In
the words of Karl Popper, one of the prominent philosophers of science:
“So long as a theory withstands severe tests and is not superseded by
another theory in the course of scientific progress, we may say that it
has ‘proved its mettle’ [54].
Simply computing an (absolute or relative) error measure between
a known, highly accurate solution and a current algorithmic output
tends to be a standard in code verification and is common practice
in visualization research. However, Etiene et al. [
17
] recently have
pointed out that asymptotic error measures can be more powerful in
finding problems in an implementation.
Reproducibility of experiments is essential during the validation
process. Experiments are often the basis for our conceptual models
of the real world, but come with imprecision attached. Being able to
quantify this error and reproduce the measurements greatly increases
the confidence in the model and theories—which is important in the
social and natural sciences alike. Based on this notion a movement in
the computational sciences has been established known as reproducible
research [
15
]. It advocates the publication of data and source code
together with the paper in order to improve independent validation of
the proposed models and thereby increasing the trust in them and to
accelerate scientific progress. To address these issues in our community
(see also, e. g., [
16
,
21
,
31
,
67
]), recently the EuroRVVV workshop
1
has been created to discuss, in particular, problems of reproducibility,
verification, and validation in visualization research.
2.2 Human-In-The-Loop
The Human-Computer Interaction (HCI) community has focused on
understanding the human-centered-design process specific to computa-
tional tools. Hence, while the functionality (effectiveness) of a tool is
of primary concern, usability (efficiency) and aesthetics (affect) play an
important role as well [
61
,
63
]. Just like for the scientific method, there
is no possibility to fool-proof a tool. Hence, in both cases—science
and engineering—evaluation is based on empirical methods.
While the general practices in HCI are also applicable to our field,
visualization research has several unique properties that have led re-
searchers to reflect on how to best study and evaluate visualization
tools. Therefore, with BELIV
2
a dedicated workshop series has been
established on this topic, similar to EuroRVVV.
Carpendale [
9
] provides an excellent overview of different empirical
evaluation approaches and strategies as they can be applied to visualiza-
tion research. In particular, she describes quantitative and qualitative
evaluation methods, highlighting the advantages and challenges for
both. Important differences between a quantitative, controlled approach
and a more qualitative study technique that aims at measuring insight
using open-ended protocols have also been argued well by North [
50
].
While not specifically focused on visualization, a further important
1
EuroVis Workshop on Reproducibility, Verification, and Validation in Visu-
alization; see http://www.eurorvvv.org/ .
2
Beyond Time and Errors: Novel Evaluation Methods for (Information)
Visualization; see http://www.beliv.org/ .
paper on finding the right study approach is Greenberg and Buxton’s
“usability evaluation considered harmful (some of the time)” [
23
], in
which they discuss pitfalls of over-focusing on usability studies.
Munzner’s Nested Model [
47
] identifies four levels of visualization
design—problem characterization, data and task abstraction, visual
encoding/interaction design, and algorithm design—and provides guid-
ance on valid evaluation methods for these different design levels. Tory
and M
¨
oller [
65
] specifically discuss the role of human factors in visual-
ization and advocate various methods applied in user-centered design
processes. Based on such methods, Sedlmair et al. [
59
] provide a de-
sign study methodology that guides the selection of such evaluation
methods in problem-driven and collaborative visualization projects.
2.3 Novel Evaluation Methods
While visualization research borrows heavily from other disciplines,
several researchers have either developed new methods or discussed
in detail how certain approaches need to be extended for visualization
evaluation. In particular, empiric evaluation and the consideration of
human factors are actively being discussed [
3
,
9
,
10
,
11
,
19
,
36
,
46
,
53
,
65, 68]. Here we highlight some select methods discussed in the past.
North’s insight-based method mentioned above [
50
] is one promi-
nent example of a novel method for visualization evaluation. Others
have reflected on the use of a critical inspection as a form of evaluating
visualizations. This inspection can either be done by the authors of
an article themselves or by reporting feedback from external experts.
This “critical thinking about visualization” [
34
] has to be neutral and be
backed up by facts as Kosara points out. In addition, Tory and M
¨
oller
[
66
] argue that domain expert feedback can be a viable complement
to controlled studies, both for heuristic evaluation of usability as well
as for understanding the support of high-level cognitive tasks. How-
ever, not only the judgment of domain experts but also that of visual
experts such as artists, graphic designers, or illustrators can be useful
as has been shown in a few cases [
1
,
27
,
29
,
30
]—in particular since
‘critique’ as a technique originated from teaching in the visual arts [
35
].
In visualization, it can be used in combination with techniques such as
sketching and ideation when developing new visualization techniques
[
26
]. For evaluating the impact of visualization tools on practices of
real users, specific forms of case studies have been suggested [
62
].
Even gameplay as a form of human computing can be used to involve a
wide audience in evaluating visualizations [
2
]. Van Wijk [
69
,
70
] em-
ployed an economic model to assess the “value of visualization” based
on effectiveness and efficiency. He explains the success or its lack with
this model for a number of example visualization techniques and tools.
Van Wijk’s model, however, is based on the correct estimation of costs
and benefits for a tool, which is often difficult to obtain in practice.
2.4 The Practice of Evaluation in Visualization Research
While the cited researchers have reflected on methodological ap-
proaches for evaluation, others have looked systematically at evaluation
in visualization (e. g., [
11
,
52
,
65
]). Most influential for our work is the
meta-study on evaluation goals by Lam et al. [
38
]. They examined 850
papers published at the IEEE Information Visualization symposium/
conference (InfoVis), in Palgrave’s Journal of Information Visualization
(IV), at the IEEE Symposium on Visual Analytics in Science and Tech-
nology (VAST), as well as at the Eurographics Symposium/Conference
on Visualization (EuroVis)
3
in the years 1995–2010. Based on their
analysis, Lam et al. identified seven scenarios that delineate empirical
evaluation goals and visualization questions prevalent in the examined
visualization publications. They found that, in the chosen sample of
papers, the use of evaluation in visualization papers is steadily increas-
ing but that the types of evaluation most frequently used are those
that examine people’s performances, users experience, and objective
measures of algorithm quality and performance. However, because
Lam et al.’s systematic review of the use of evaluation in visualization
3
Of the papers published at the EuroVis conferences, (following their review-
ers’ request) Lam et al. [
38
] excluded those papers they (Lam et al.) classified
as “pure SciVis papers [ . . . ] based on visualization type: (e. g., pure volume,
molecular, fibre-bundle, or flow visualization).

is restricted to what they consider to be ‘information’ visualization
work, it only provides a part of the whole picture. In our own work we
use Lam et al.’s set of seven scenarios to analyze the use of evaluation
in the research published at the IEEE Visualization conference and
compare our results to those by Lam et al. Their rigorous methodologi-
cal coding approach and the resulting descriptive scenarios give us a
straight-forward way to build and extend upon their work. Furthermore,
it allows us to compare different practices and trends in the visualiza-
tion sub-communities of ‘scientific’ and ‘information’ visualization,
and to draw conclusion based upon these findings.
3 APPROACH AND METHODOLOGY
In order to get a systematic overview of the state of evaluation in visual-
ization we conducted a rigorous qualitative literature review. Qualitative
literature reviews are a standard technique in many areas of science
to objectively report on current knowledge or practices on a topic of
interest [
22
]. As a comprehensive overview, a literature review can
help to place a topic or practices into perspective. We approached our
literature review as discussed in the following sections.
3.1 Choice of Literature
To get a comprehensive overview of the use of evaluation in visual-
ization as a whole we assessed the respective practices in the IEEE
Visualization conference (now IEEE Scientific Visualization) and com-
pared it later to the previous assessment by Lam et al. [
38
] of the IEEE
Information Visualization conference. This approach allowed us to
inspect a good cross-section of topics, approaches, and solutions com-
mon to the visualization community. Out of the past 23 years of the
IEEE Visualization conference, we chose to code the past seven years
(2012–2006) as well as 2003, 2000, and 1997. Coding the past seven
years allowed us to reflect on current practices, while coding the earlier
years allowed us to put results into historical perspective.
3.2 Choice of Codes
We based our coding scheme on the seven scenarios presented by Lam
et al. [
38
]. Each scenario was assigned as a code. In the process of
coding, we extended the initial list by one code (
QRI
). We also decided
to rename Lam et al.’s
VA
(Visualization Algorithms) code into
AP
(Algorithm Performance) to more accurately reflect our findings. Based
on these changes, we used the following list of codes:
UWP
Understanding Environments and Work Practices: This code
includes evaluations that derive an understanding of the work, anal-
ysis, or information processing practices by a given group of people
with or without software use. Common examples are evaluations
with experts to understand their data analysis needs and require-
ments for developing a visualization.
VDAR
Visual Data Analysis and Reasoning: This code includes
evaluations that assess how a visualization tool supports analysis
and reasoning about data and helps to derive relevant knowledge
in a given domain. Example evaluations include those that study
experts using a tool on their data and analyzing how they can solve
domain-specific questions with a new tool.
CTV
Evaluating Communication Through Visualization: This code
includes evaluations that assess the communicative value of a visu-
alization or visual representation in regards to goals such as teach-
ing/learning, idea presentation, or casual use. For example, a study
that assesses how well a visualization can communicate medical
information to a patient would fall into this category.
CDA
Evaluating Collaborative Data Analysis: Evaluations in this
group try to understand to what extent a visualization tool supports
collaborative data analysis by groups of people.
While the previous scenarios focused on the process of data analysis,
the remainder focuses on understanding visualizations or visualization
systems and algorithms:
UP
User Performance: Evaluations in this category objectively mea-
sure how specific features affect the performance of people with
a system. Controlled experiments using time and error are typical
example methods in this category.
UE
User Experience: This code includes evaluations that elicit sub-
jective feedback and opinions on a visualization (tool). Interviews
and Likert-scale questionnaires are common methods to do so.
AP
Algorithm Performance: Evaluations in this category quantita-
tively study the performance or quality of visualization algorithms.
The most common examples include measurements of rendering
speed or memory performance. This scenario was originally called
VA (Visualization Algorithms) in Lam et al.’s [38] paper.
QRI
Qualitative Result Inspection: Evaluations in this category are
evaluations through qualitative discussions and assessments of visu-
alization results. In contrast to UE, they do not involve actual end
users or participants but instead ask the viewer of a resulting image
to make an assessment for themselves. The following section gives
details on why we chose to add this code.
3.3 Coding Method
Five coders (all co-authors of this paper) participated in the assessment
of the literature. To calibrate, we started coding with Lam et al.’s [
38
]
seven unique scenarios. We randomly picked the year 2009 to calibrate
our codes. Each of the 54 papers in this year was assigned to a varying
set of three coders. After the first coding pass the inter-coder reliability
had reached 0.743 (Krippendorffs alpha [
37
, Ch. 12]). For each paper
with a conflicted code, the coders discussed reasons for discrepancies
and resolved them. After the first coding pass, all coders also met to
discuss whether the initial code set needed to be extended. It was at this
point that we decided to introduce the QRI code. We noted that a large
number of papers in 2009 used QRI as an evaluation or proof for the
quality of their results. A discussion among the authors of this paper
ensued as to the validity of the code as an actual type of ‘evaluation
and we will further reflect on the issue in Sect. 4. Yet, given that the
coding of 2009 revealed an apparent prevalence of papers with QRI and
because of past discussions of the approach in the literature [
34
,
47
,
69
],
we included this code to quantitatively assess its actual prevalence and
to be able to qualitatively discuss QRI practices.
After the first conflicts had all been resolved, the remaining papers
were assigned to one coder each. In the process of coding the remaining
papers all coders remained in close contact and communicated when
choices were made as to which evaluations to include. For example,
one point of discussion pertained to the amount of rigor required for an
evaluation to be included. While some articles, for example, included
multi-dataset comparisons of rendering speeds, some papers just re-
ported them for a single example or in a rough manner (“less than 1
second”). We decided not to exclude evaluations based on rigor but
chose to exclude cases that just reported anecdotal evidence. Papers
that were unclear were marked and re-coded by a second coder.
4 RESULTS AND DISCUSSION
Collaboratively coding this broad set of papers led to many intensive
discussions among the authors regarding current evaluation practices
in our field, the meaning of rigorous and convincing evaluation with
respect to what we observed in the papers, as well as the range of differ-
ent evaluation approaches that are covered within the categories. In this
section, we summarize both quantitative and qualitative observations
from our literature analysis and our discussions about them.
In total we coded 581 papers from the IEEE Visualization conference
as discussed in Sect. 3.
4
Out of these, 569 (97%) included at least one
type of evaluation from our code set and 441 (76%) at least one out
of the original seven scenarios [
38
]. In total, we coded 1002 scenarios
spread across the 569 papers, meaning that many papers included
several evaluations with differing goals. Fig. 1(a) shows a histogram of
the total number of papers coded per scenario while Fig. 1(b) gives a
historic overview of the spread of evaluation scenarios coded, in percent
of all papers in a given year. Next, we discuss more detailed findings.
4.1 Evaluation Scenarios
The original seven scenarios were grouped by Lam et al. into two main
categories: understanding visualizations (UP, UE, AP) and understand-
ing data analysis processes (UWP, VDAR, CDA, CTV). Our new code
4
The data can be found in a Google Spreadsheet at http://goo.gl/CGswy .

0
50
100
150
200
250
300
350
400
450
QRI AP UE UP UWP VDAR CDA CTV
(a)
Total number of papers per scenario (years: 1997, 2000, 2003, 2006–2012).
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1997
2000
2003
2006
2007
2008
2009
2010
2011
2012
Visualization
Process
QRI
AP (VA)
UE
UP
UWP
VDAR
CDA
CTV
(b) Scenarios in percent of papers per year coded.
Fig. 1. Evaluation scenarios for IEEE Visualization conference papers.
QRI falls into the visualization category, leading to 955/1002 (95%)
of all evaluation scenarios we found present in the coded papers to be
of this category. We found only 47 scenarios (4.7%) that studied pro-
cesses of data analysis. We found no instance of a study that assessed
communicative value of a visualization (CTV).
The most common visualization scenario was QRI (46% of all sce-
narios) followed by AP (35% of all scenarios). Both scenarios together
covered 81% of all the scenarios we coded. Evaluations in both QRI
and AP were always conducted (by definition) without actual study
participants. Interestingly, these two codes differ sharply from the other
two in this category that involve studying participants: UE (9.3% of all
scenarios) and UP (4.6% of all scenarios).
The following subsections discuss our most interesting findings and
observations in regards to specific scenarios in more detail.
4.1.1 AP: The Importance of FPS and Memory Footprint
The reporting of performance of a (novel) algorithm, technique, or tool
was, with 35% of all coded scenarios, the second most frequent type
of evaluation we observed. Typically, authors reported computation
times for processing or rendering speeds in frames per second, for a
number of example datasets and a given platform. In the earlier papers,
we also frequently observed the reporting of memory footprints, albeit
less often in more recent papers—probably since memory shortage
seems to be less of a concern these days. Such performance metrics are
instructive because they inform the reader about the dataset sizes an
implementation or technique is applicable for, on a given platform.
Other objective metrics to evaluate a technique or implementation are
those that quantitatively assess the quality of a visualization algorithm.
Their goal is, therefore, to measure what a user can see or observe
without the need to study participants. In graph drawing, for instance,
quality measures such as the number of edge crossings are used as
criteria to assess the readability. For the visualization literature we
analyzed, this subset of AP evaluations included, for example, the
reporting of compression rates for shape compression techniques or
error metrics for the generated visuals. Such objective metrics to assess
a technique or its produced results was typically provided when some
kind of ground truth or other established quality standards (visual and
otherwise) existed against which results could be compared.
While frequently used as an evaluation approach, we saw a wide
range of reporting rigor in AP scenarios. In fact, it was when coding
this evaluation category that a discussion ensued among the authors at
which point the report of AP results should be called an ‘evaluation. In
particular, for time and memory performance we saw papers that simply
reported a single frames-per-second number for a single, specifically
selected dataset. Some papers just reported a range of rendering speeds
without further specification of the dataset used or on which platform
they were produced. A popular but notoriously imprecise assessment
for rendering speeds was the term “interactive framerates” which can
mean anything from 1 fps to 120 fps or more. We decided not to code
such ‘performance evaluations’ as AP if they were limited to a single
measurement without a clear platform or dataset. In contrast, typical
evaluations gave rendering speeds for a number of different example
datasets and reported the used platform. Good evaluations analyzed
the behavior for a range of dataset sizes or used a number of additional
metrics to assess different concepts of visual quality of the results. A
good performance analysis was presented, for instance, by Lindstrom
and Isenburg [39]; a nice example of objectively analyzing the quality
of a proposed visualization result is Schultz and Seidel’s work [57].
4.1.2 QRI: Qualitative Result Inspection
As discussed previously in Sect. 3, we added one category to the
seven categories defined by Lam et al.: the prevalent ‘qualitative results
inspection’ category (QRI). We included this code despite the fact that it
is not an evaluation in the traditional sense. Put simply, a QRI addresses
the reader of a paper and encourages him/her to agree on a quality
statement by inspecting a result image. An example statement could be
that “the figure shows that our tool can clearly depict structure x in the
data which was impossible with previous approaches. While this is
just one example, we encountered a variety of different approaches and
factors of what we considered as QRI. While we decided to not further
split this category during the coding, we next discuss these details for a
better understanding of the breadth and rigor we saw in this scenario.
In essence, we found three important types:
1.
Image Quality: The classical form of QRI we found was the qual-
itative discussion of images produced by a (rendering) algorithm.
A new algorithm was often targeted at producing images of a
certain quality and it was common to show and assess visually
that quality goals had been met (e. g., [48]).
2.
Visual Encoding: Introducing a new visual encoding (e. g., a
novel transfer function or novel glyphs for vector and tensor
fields) was also quite common. An example was the introduction
of superquadric glyphs for second-order tensors by Schultz and
Kindlmann [
56
]. A QRI would in this case highlight what these
new encodings could show and how.
3.
Walkthrough: We intentionally, however, did not limit the scope
of this category to visual encoding or image quality discussions,
as we also found instances of qualitative discussions of system
behavior (e. g., [
72
]) and interaction concepts (e. g., [
8
]). These
discussions convincingly validated the proposed contributions.
We found two major approaches in how QRI were conducted: com-
parative and isolated. Comparative result inspections had clear state-
of-the-art competitors that provided different solutions for the same
problem. The goal was to improve upon these current state-of-the-art
solutions. A typical approach was to compute output images with
different algorithms, including the newly proposed one, for a range of
different datasets. These images were then compared side-by-side and
the authors walked the reader through them to explain the differences
and benefits of the newly proposed algorithm (e. g., [48]).
Another approach was to qualitatively inspect the results in isolation;
i. e., there was no clear competitor that addressed the same problem
which could be used for comparison. In those cases, a solid description
of the problem at hand as well as the justification of how the proposed
new algorithm/technique/system addressed it was mandatory. Not
doing so resulted in a pure description much like a manual that failed
the purpose of evaluation—we did not code these descriptions.

Citations
More filters
Journal ArticleDOI

Participatory IT Design: Designing for Business and Workplace Realities

TL;DR: In this article, the authors present a textbook-style introduction to the processes required for use in university teaching and for self-study purposes by people working in the field of IT system development.
Journal ArticleDOI

Visual Interaction with Dimensionality Reduction: A Structured Literature Analysis

TL;DR: This work systematically studied the visual analytics and visualization literature to investigate how analysts interact with automatic DR techniques, and proposes a “human in the loop” process model that provides a general lens for the evaluation of visual interactive DR systems.
Journal ArticleDOI

Characterizing Provenance in Visualization and Data Analysis: An Organizational Framework of Provenance Types and Purposes

TL;DR: This organization is intended to serve as a framework to help researchers specify types of provenance and coordinate design knowledge across projects and can be used to guide the selection of evaluation methodology and the comparison of study outcomes in provenance research.
Journal ArticleDOI

Visual Parameter Space Analysis: A Conceptual Framework

TL;DR: The framework is based on the author's own experience and a structured analysis of the visualization literature, and contains a data flow model that helps to abstractly describe visual parameter space analysis problems independent of their application domain.
Journal ArticleDOI

Curve Boxplot: Generalization of Boxplot for Ensembles of Curves

TL;DR: A novel, nonparametric method for summarizing ensembles of 2D and 3D curves is presented and an extension of a method from descriptive statistics, data depth, to curves is proposed, which is a generalization of traditional whisker plots or boxplots to multidimensional curves.
References
More filters
Book

Content analysis: an introduction to its methodology

TL;DR: History Conceptual Foundations Uses and Kinds of Inference The Logic of Content Analysis Designs Unitizing Sampling Recording Data Languages Constructs for Inference Analytical Techniques The Use of Computers Reliability Validity A Practical Guide

Qualitative research: a guide to design and implementation / Sharan B. Merriam

TL;DR: This Discussion focuses on the design of the methodology section of a Qualitative Research Study, which involves mining data from Documents and Artifacts and dealing with Validity, Reliability, and Ethics.
Book

Qualitative Research: A Guide to Design and Implementation

TL;DR: In this paper, the authors present a methodology for the collection and reporting of qualitative data from documents, dealing with reliability, reliability, and ethics issues in a qualitative research study. But they focus on the qualitative case studies.
Book

The Logic of Scientific Discovery

Karl Popper
TL;DR: The Open Society and Its Enemies as discussed by the authors is regarded as one of Popper's most enduring books and contains insights and arguments that demand to be read to this day, as well as many of the ideas in the book.
Book

Designing the User Interface: Strategies for Effective Human-Computer Interaction

TL;DR: The Sixth Edition of Designing the User Interface provides a comprehensive, authoritative, and up-to-date introduction to the dynamic field of human-computer interaction and user experience (UX) design.
Related Papers (5)
Frequently Asked Questions (8)
Q1. What have the authors contributed in "A systematic review on the practice of evaluating visualization" ?

The authors present an assessment of the state and historic development of evaluation practices as reported in papers published at the IEEE Visualization conference. For this purpose the authors conducted a systematic review of ten years of evaluations in the published papers using and extending a coding scheme previously established by Lam et al. [ 2012 ]. In particular, the authors found that evaluations specific to assessing resulting images and algorithm performance are the most prevalent ( with consistently 80–90 % of all papers since 1997 ). Further, the authors found that generally the studies reporting requirements analyses and domain-specific work practices are too informally reported which hinders cross-comparison and lowers external validity. 

It will also certainly be interesting to extend this analysis to papers published at other visualization venues such as the Eurographics Conference on Visualization ( EuroVis ) as a place that does not make a dedicated difference between ‘ scientific ’ and ‘ information ’ visualization, to see whether similar trends exist. Also, a separate analysis of the IEEE Conference on Visual Analytics Science and Technology would be of value to see whether there is a stronger emphasis on UWP and VDAR processes as well as on collaborative visualization ( CTV/CDA ), as suggested by its name and agenda. Still, in a few years comparing VAST to their data will be able to given an even more complete picture of evaluation practices in the visualization community. 

Reporting on study protocols: Especially for AP, UP, and UE studies it is important to follow established reporting protocols to facilitate reproducibility and comparability. 

One prevalent pitfall the authors observed that relates to the issue of rigor was to consider positive subjective judgments from domain experts as a sufficient form of evaluation. 

In addition, Tory and Möller [66] argue that domain expert feedback can be a viable complement to controlled studies, both for heuristic evaluation of usability as well as for understanding the support of high-level cognitive tasks. 

In particular, the authors argue against considering evaluation only as controlled quantitative studies with null hypothesis significance tests (NHST). 

Evaluation methods used to study UWP, VDAR, CTV, and CDA are often qualitative in nature, such as interviews with domain experts, observations of work practices, or longitudinal case studies of newly proposed tools. 

The APA (American Psychological Association) provides general guidelines for reporting studies and statistical test results [71].