What have the authors contributed in "A systematic review on the practice of evaluating visualization" ?

The authors present an assessment of the state and historic development of evaluation practices as reported in papers published at the IEEE Visualization conference. For this purpose the authors conducted a systematic review of ten years of evaluations in the published papers using and extending a coding scheme previously established by Lam et al. [ 2012 ]. In particular, the authors found that evaluations specific to assessing resulting images and algorithm performance are the most prevalent ( with consistently 80–90 % of all papers since 1997 ). Further, the authors found that generally the studies reporting requirements analyses and domain-specific work practices are too informally reported which hinders cross-comparison and lowers external validity.

What future works have the authors mentioned in the paper "A systematic review on the practice of evaluating visualization" ?

It will also certainly be interesting to extend this analysis to papers published at other visualization venues such as the Eurographics Conference on Visualization ( EuroVis ) as a place that does not make a dedicated difference between ‘ scientific ’ and ‘ information ’ visualization, to see whether similar trends exist. Also, a separate analysis of the IEEE Conference on Visual Analytics Science and Technology would be of value to see whether there is a stronger emphasis on UWP and VDAR processes as well as on collaborative visualization ( CTV/CDA ), as suggested by its name and agenda. Still, in a few years comparing VAST to their data will be able to given an even more complete picture of evaluation practices in the visualization community.

What is the importance of reporting on study protocols?

Reporting on study protocols: Especially for AP, UP, and UE studies it is important to follow established reporting protocols to facilitate reproducibility and comparability.

What is the common pitfall in UE evaluations?

One prevalent pitfall the authors observed that relates to the issue of rigor was to consider positive subjective judgments from domain experts as a sufficient form of evaluation.

What is the argument against evaluating studies?

In particular, the authors argue against considering evaluation only as controlled quantitative studies with null hypothesis significance tests (NHST).

What are the evaluation methods used to study UWP, VDAR, CTV,?

Evaluation methods used to study UWP, VDAR, CTV, and CDA are often qualitative in nature, such as interviews with domain experts, observations of work practices, or longitudinal case studies of newly proposed tools.

What is the APA’s general guidelines for reporting studies?

The APA (American Psychological Association) provides general guidelines for reporting studies and statistical test results [71].

(Open Access) A Systematic Review on the Practice of Evaluating Visualization (2013) | Tobias Isenberg

Q: What is the main argument for the use of domain expert feedback?

In addition, Tory and Möller [66] argue that domain expert feedback can be a viable complement to controlled studies, both for heuristic evaluation of usability as well as for understanding the support of high-level cognitive tasks.

HAL Id: hal-00846775

https://hal.inria.fr/hal-00846775

Submitted on 25 Aug 2013

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

A Systematic Review on the Practice of Evaluating

Visualization

Tobias Isenberg, Petra Isenberg, Jian Chen, Michael Sedlmair, Torsten Möller

To cite this version:

Tobias Isenberg, Petra Isenberg, Jian Chen, Michael Sedlmair, Torsten Möller. A Systematic Re-

view on the Practice of Evaluating Visualization. IEEE Transactions on Visualization and Com-

puter Graphics, Institute of Electrical and Electronics Engineers, 2013, 19 (12), pp.2818-2827.

�10.1109/TVCG.2013.126�. �hal-00846775�

A Systematic Review on the Practice of Evaluating Visualization

Tobias Isenberg, Senior Member, IEEE, Petra Isenberg, Jian Chen, Member, IEEE,

Michael Sedlmair, Member, IEEE, and Torsten M

oller, Senior Member, IEEE

Abstract

—We present an assessment of the state and historic development of evaluation practices as reported in papers published

at the IEEE Visualization conference. Our goal is to reﬂect on a meta-level about evaluation in our community through a systematic

understanding of the character istics and goals of presented evaluations. For this pur pose we conducted a systematic review of

ten years of evaluations in the published papers using and extending a coding scheme previously established by Lam et al. [2012].

The results of our review include an overview of the most common evaluation goals in the community, how they evolved over time,

and how they contrast or align to those of the IEEE Information Visualization conference. In particular, we found that evaluations

speciﬁc to assessing resulting images and algorithm performance are the most prevalent (with consistently 80–90% of all papers since

1997). However, especially over the last six years there is a steady increase in evaluation methods that include participants, either by

evaluating their performances and subjective feedback or by evaluating their work practices and their improved analysis and reasoning

capabilities using visual tools. Up to 2010, this trend in the IEEE Visualization conference was much more pronounced than in the

IEEE Information Visualization conference which only showed an increasing percentage of evaluation through user performance and

experience testing. Since 2011, however, also papers in IEEE Information Visualization show such an increase of evaluations of work

practices and analysis as well as reasoning using visual tools. Further, we found that generally the studies reporting requirements

analyses and domain-speciﬁc work practices are too informally reported which hinders cross-comparison and lowers external validity.

Index Terms—Evaluation, validation, systematic review, visualization, scientiﬁc visualization, information visualization

1 MOTIVATION

In this paper, we report a systematic review of 581 papers from ten years

of IEEE Visualization conference publications with respect to their use

of evaluation. We provide a quantitative and objective report of the

types of evaluations encountered in the literature. At the same time, we

also qualitatively assess our observations from coding these 581 papers.

Speciﬁcally, we put evaluation practices into historic perspective and

assess and compare them in context to those of the larger visualization

community. Our goal in pursuing this work is to get an understanding

of the practices of evaluation in visualization research as a whole.

The importance of evaluation to the ﬁeld of visualization has become

well recognized—demonstrated by the growing body of work on how to

conduct visualization evaluation and by the growing amount of research

papers that incorporate some form of formal or informal evaluation. In

this article we contribute to the body of work by providing a systematic

assessment and understanding of the evaluation practices reﬂected by

published peer-reviewed visualization papers that have not been subject

to such a systematic assessment in the past.

Our work is based on Lam et al.’s [

] recent literature analysis,

in which they identiﬁed seven evaluation scenarios in visualization

research articles. Their paper is an important contribution but does not

reﬂect on the entire visualization community. It focuses on what is

known as the ‘information visualization’ sub-community and excludes

all other visualization ﬂavors. While Lam et al. primarily focused on

identifying evaluation scenarios, our goal with this paper is different.

We aim to complete the assessment for the larger visualization commu-

nity by answering the question: What are evaluation practices in the

‘scientiﬁc visualization’ part of our community? What are similarities

• Tobias Isenberg is with INRIA, France. E-mail: tobias.isenberg@inria.fr .

• Petra Isenberg is with INRIA, France. E-mail: petra.isenberg@inria.fr .

• Jian Chen is with the University of Maryland, Baltimore County, USA.

E-mail: jichen@umbc.edu .

• Michael Sedlmair is with the University of Vienna, Austria. E-mail:

michael.sedlmair@univie.ac.at .

• Torsten M

oller is with the University of Vienna, Austria. E-mail:

torsten.moeller@univie.ac.at .

Manuscript received 31 March 2013; accepted 1 August 2013; posted online

13 October 2013; mailed on 4 October 2013.

For information on obtaining reprints of this article, please send

e-mail to: tvcg@computer.org.

and differences between these sub-communities? To do so, we use and

extend Lam et al.’s scenarios to systematically analyze the literature

that appeared at the IEEE Visualization conference. We believe that

our extended work is fundamental to understanding all subcultures in

visualization and to properly sample all aspects of visualization work,

not only those labeled as ‘information visualization.’

By looking at the historic record, we were hoping to uncover some

trends by examining how the ﬁeld of visualization has been changing

over the last 15 years. We were wondering whether some of the self-

reﬂection by some of the ﬁeld’s leaders in the early 2000’s has left

its mark on our community and whether it led to more rigor in our

evaluations. Likewise, our work is an opportunity to compare the IEEE

Information Visualization and IEEE Visualization conferences to better

understand their differences and commonalities. Our analysis of evalu-

ation methods in visualization exposed a number of both weaknesses

and strengths from which we, as a community, can learn for future

work. Hence, we not only describe the current evaluation practices but

also show what evaluation types are possible and how to improve their

reporting in visualization papers. We thus expose exemplary papers

and discuss a number of pitfalls that should be avoided.

In summary, the contributions of our paper are threefold. First, we

objectively report the current evaluation practices in the visualization

community. This is a quantitative report, focusing on the works in

the IEEE Visualization conference, complementing the work done by

Lam et al. [

]. Second, we give a historical overview of the use of

evaluation in the visualization community as reported in the IEEE

Information Visualization and IEEE Visualization conferences and put

evaluation practices into perspective. This is a qualitative assessment

and provides a historical perspective by comparing current and past

evaluation practices. And, third, we provide information for researchers

conducting evaluation by assisting them to identify, justify, and reﬁne

evaluation approaches as well as helping them to recognize and avoid

pitfalls that can be learned from previous research.

2 FUNDAMENTALS AND RELATED WORK

There are two traditions of evaluation that the visualization community

draws from—evaluation in the sciences (both social and natural) and

evaluation in design. On the one hand, scientists try to understand the

world and seek a representative model, often a mathematical model

(e. g., Newton’s law or Fitts’ law), while designers and engineers intro-

duce a tool and henceforth seek to alter the world in which they live and

with which they interact. Science is concerned with model validation

and reproducibility. In addition, in the computational sciences, the

mathematical model is turned into a computer algorithm. This invokes

challenges of verifying the algorithm based on the mathematical model.

For designers, the focus is what is called a ‘user,’ putting the empha-

sis on the ‘human-in-the-loop.’ Hence, aspects of tool functionality,

usability, and aesthetics are of concern.

2.1 Validation, Veriﬁcation, and Reproducibility

In computational science, validation refers to the process of ensuring

the correctness of a conceptual or mathematical model with the salient

aspects of reality [

]. In contrast, veriﬁcation refers to the process

of determining the accuracy of an algorithmic implementation with

respect to the mathematical model. It is important to point out the

dilemma in science that theories and models cannot be validated, only

invalidated. Hence, the process of validation and veriﬁcation tends to

be a difﬁcult one and of empirical nature. It is thus common to test

one’s algorithms and models on a number of well-chosen test cases. In

the words of Karl Popper, one of the prominent philosophers of science:

“So long as a theory withstands severe tests and is not superseded by

another theory in the course of scientiﬁc progress, we may say that it

has ‘proved its mettle’ ” [54].

Simply computing an (absolute or relative) error measure between

a known, highly accurate solution and a current algorithmic output

tends to be a standard in code veriﬁcation and is common practice

in visualization research. However, Etiene et al. [

] recently have

pointed out that asymptotic error measures can be more powerful in

ﬁnding problems in an implementation.

Reproducibility of experiments is essential during the validation

process. Experiments are often the basis for our conceptual models

of the real world, but come with imprecision attached. Being able to

quantify this error and reproduce the measurements greatly increases

the conﬁdence in the model and theories—which is important in the

social and natural sciences alike. Based on this notion a movement in

the computational sciences has been established known as reproducible

research [

]. It advocates the publication of data and source code

together with the paper in order to improve independent validation of

the proposed models and thereby increasing the trust in them and to

accelerate scientiﬁc progress. To address these issues in our community

(see also, e. g., [

]), recently the EuroRVVV workshop

has been created to discuss, in particular, problems of reproducibility,

veriﬁcation, and validation in visualization research.

2.2 Human-In-The-Loop

The Human-Computer Interaction (HCI) community has focused on

understanding the human-centered-design process speciﬁc to computa-

tional tools. Hence, while the functionality (effectiveness) of a tool is

of primary concern, usability (efﬁciency) and aesthetics (affect) play an

important role as well [

]. Just like for the scientiﬁc method, there

is no possibility to fool-proof a tool. Hence, in both cases—science

and engineering—evaluation is based on empirical methods.

While the general practices in HCI are also applicable to our ﬁeld,

visualization research has several unique properties that have led re-

searchers to reﬂect on how to best study and evaluate visualization

tools. Therefore, with BELIV

a dedicated workshop series has been

established on this topic, similar to EuroRVVV.

Carpendale [

] provides an excellent overview of different empirical

evaluation approaches and strategies as they can be applied to visualiza-

tion research. In particular, she describes quantitative and qualitative

evaluation methods, highlighting the advantages and challenges for

both. Important differences between a quantitative, controlled approach

and a more qualitative study technique that aims at measuring insight

using open-ended protocols have also been argued well by North [

While not speciﬁcally focused on visualization, a further important

EuroVis Workshop on Reproducibility, Veriﬁcation, and Validation in Visu-

alization; see http://www.eurorvvv.org/ .

Beyond Time and Errors: Novel Evaluation Methods for (Information)

Visualization; see http://www.beliv.org/ .

paper on ﬁnding the right study approach is Greenberg and Buxton’s

“usability evaluation considered harmful (some of the time)” [

], in

which they discuss pitfalls of over-focusing on usability studies.

Munzner’s Nested Model [

] identiﬁes four levels of visualization

design—problem characterization, data and task abstraction, visual

encoding/interaction design, and algorithm design—and provides guid-

ance on valid evaluation methods for these different design levels. Tory

and M

oller [

] speciﬁcally discuss the role of human factors in visual-

ization and advocate various methods applied in user-centered design

processes. Based on such methods, Sedlmair et al. [

] provide a de-

sign study methodology that guides the selection of such evaluation

methods in problem-driven and collaborative visualization projects.

2.3 Novel Evaluation Methods

While visualization research borrows heavily from other disciplines,

several researchers have either developed new methods or discussed

in detail how certain approaches need to be extended for visualization

evaluation. In particular, empiric evaluation and the consideration of

human factors are actively being discussed [

65, 68]. Here we highlight some select methods discussed in the past.

North’s insight-based method mentioned above [

] is one promi-

nent example of a novel method for visualization evaluation. Others

have reﬂected on the use of a critical inspection as a form of evaluating

visualizations. This inspection can either be done by the authors of

an article themselves or by reporting feedback from external experts.

This “critical thinking about visualization” [

] has to be neutral and be

backed up by facts as Kosara points out. In addition, Tory and M

oller

[

] argue that domain expert feedback can be a viable complement

to controlled studies, both for heuristic evaluation of usability as well

as for understanding the support of high-level cognitive tasks. How-

ever, not only the judgment of domain experts but also that of visual

experts such as artists, graphic designers, or illustrators can be useful

as has been shown in a few cases [

]—in particular since

‘critique’ as a technique originated from teaching in the visual arts [

In visualization, it can be used in combination with techniques such as

sketching and ideation when developing new visualization techniques

[

]. For evaluating the impact of visualization tools on practices of

real users, speciﬁc forms of case studies have been suggested [

Even gameplay as a form of human computing can be used to involve a

wide audience in evaluating visualizations [

]. Van Wijk [

] em-

ployed an economic model to assess the “value of visualization” based

on effectiveness and efﬁciency. He explains the success or its lack with

this model for a number of example visualization techniques and tools.

Van Wijk’s model, however, is based on the correct estimation of costs

and beneﬁts for a tool, which is often difﬁcult to obtain in practice.

2.4 The Practice of Evaluation in Visualization Research

While the cited researchers have reﬂected on methodological ap-

proaches for evaluation, others have looked systematically at evaluation

in visualization (e. g., [

]). Most inﬂuential for our work is the

meta-study on evaluation goals by Lam et al. [

]. They examined 850

papers published at the IEEE Information Visualization symposium/

conference (InfoVis), in Palgrave’s Journal of Information Visualization

(IV), at the IEEE Symposium on Visual Analytics in Science and Tech-

nology (VAST), as well as at the Eurographics Symposium/Conference

on Visualization (EuroVis)

in the years 1995–2010. Based on their

analysis, Lam et al. identiﬁed seven scenarios that delineate empirical

evaluation goals and visualization questions prevalent in the examined

visualization publications. They found that, in the chosen sample of

papers, the use of evaluation in visualization papers is steadily increas-

ing but that the types of evaluation most frequently used are those

that examine people’s performances, users experience, and objective

measures of algorithm quality and performance. However, because

Lam et al.’s systematic review of the use of evaluation in visualization

Of the papers published at the EuroVis conferences, (following their review-

ers’ request) Lam et al. [

] excluded those papers they (Lam et al.) classiﬁed

as “pure SciVis papers [ . . . ] based on visualization type: (e. g., pure volume,

molecular, ﬁbre-bundle, or ﬂow visualization).”

is restricted to what they consider to be ‘information’ visualization

work, it only provides a part of the whole picture. In our own work we

use Lam et al.’s set of seven scenarios to analyze the use of evaluation

in the research published at the IEEE Visualization conference and

compare our results to those by Lam et al. Their rigorous methodologi-

cal coding approach and the resulting descriptive scenarios give us a

straight-forward way to build and extend upon their work. Furthermore,

it allows us to compare different practices and trends in the visualiza-

tion sub-communities of ‘scientiﬁc’ and ‘information’ visualization,

and to draw conclusion based upon these ﬁndings.

3 APPROACH AND METHODOLOGY

In order to get a systematic overview of the state of evaluation in visual-

ization we conducted a rigorous qualitative literature review. Qualitative

literature reviews are a standard technique in many areas of science

to objectively report on current knowledge or practices on a topic of

interest [

]. As a comprehensive overview, a literature review can

help to place a topic or practices into perspective. We approached our

literature review as discussed in the following sections.

3.1 Choice of Literature

To get a comprehensive overview of the use of evaluation in visual-

ization as a whole we assessed the respective practices in the IEEE

Visualization conference (now IEEE Scientiﬁc Visualization) and com-

pared it later to the previous assessment by Lam et al. [

] of the IEEE

Information Visualization conference. This approach allowed us to

inspect a good cross-section of topics, approaches, and solutions com-

mon to the visualization community. Out of the past 23 years of the

IEEE Visualization conference, we chose to code the past seven years

(2012–2006) as well as 2003, 2000, and 1997. Coding the past seven

years allowed us to reﬂect on current practices, while coding the earlier

years allowed us to put results into historical perspective.

3.2 Choice of Codes

We based our coding scheme on the seven scenarios presented by Lam

et al. [

]. Each scenario was assigned as a code. In the process of

coding, we extended the initial list by one code (

QRI

). We also decided

to rename Lam et al.’s

(Visualization Algorithms) code into

(Algorithm Performance) to more accurately reﬂect our ﬁndings. Based

on these changes, we used the following list of codes:

UWP

Understanding Environments and Work Practices: This code

includes evaluations that derive an understanding of the work, anal-

ysis, or information processing practices by a given group of people

with or without software use. Common examples are evaluations

with experts to understand their data analysis needs and require-

ments for developing a visualization.

VDAR

Visual Data Analysis and Reasoning: This code includes

evaluations that assess how a visualization tool supports analysis

and reasoning about data and helps to derive relevant knowledge

in a given domain. Example evaluations include those that study

experts using a tool on their data and analyzing how they can solve

domain-speciﬁc questions with a new tool.

CTV

Evaluating Communication Through Visualization: This code

includes evaluations that assess the communicative value of a visu-

alization or visual representation in regards to goals such as teach-

ing/learning, idea presentation, or casual use. For example, a study

that assesses how well a visualization can communicate medical

information to a patient would fall into this category.

CDA

Evaluating Collaborative Data Analysis: Evaluations in this

group try to understand to what extent a visualization tool supports

collaborative data analysis by groups of people.

While the previous scenarios focused on the process of data analysis,

the remainder focuses on understanding visualizations or visualization

systems and algorithms:

User Performance: Evaluations in this category objectively mea-

sure how speciﬁc features affect the performance of people with

a system. Controlled experiments using time and error are typical

example methods in this category.

User Experience: This code includes evaluations that elicit sub-

jective feedback and opinions on a visualization (tool). Interviews

and Likert-scale questionnaires are common methods to do so.

Algorithm Performance: Evaluations in this category quantita-

tively study the performance or quality of visualization algorithms.

The most common examples include measurements of rendering

speed or memory performance. This scenario was originally called

VA (Visualization Algorithms) in Lam et al.’s [38] paper.

QRI

Qualitative Result Inspection: Evaluations in this category are

evaluations through qualitative discussions and assessments of visu-

alization results. In contrast to UE, they do not involve actual end

users or participants but instead ask the viewer of a resulting image

to make an assessment for themselves. The following section gives

details on why we chose to add this code.

3.3 Coding Method

Five coders (all co-authors of this paper) participated in the assessment

of the literature. To calibrate, we started coding with Lam et al.’s [

]

seven unique scenarios. We randomly picked the year 2009 to calibrate

our codes. Each of the 54 papers in this year was assigned to a varying

set of three coders. After the ﬁrst coding pass the inter-coder reliability

had reached 0.743 (Krippendorff’s alpha [

, Ch. 12]). For each paper

with a conﬂicted code, the coders discussed reasons for discrepancies

and resolved them. After the ﬁrst coding pass, all coders also met to

discuss whether the initial code set needed to be extended. It was at this

point that we decided to introduce the QRI code. We noted that a large

number of papers in 2009 used QRI as an evaluation or proof for the

quality of their results. A discussion among the authors of this paper

ensued as to the validity of the code as an actual type of ‘evaluation’

and we will further reﬂect on the issue in Sect. 4. Yet, given that the

coding of 2009 revealed an apparent prevalence of papers with QRI and

because of past discussions of the approach in the literature [

we included this code to quantitatively assess its actual prevalence and

to be able to qualitatively discuss QRI practices.

After the ﬁrst conﬂicts had all been resolved, the remaining papers

were assigned to one coder each. In the process of coding the remaining

papers all coders remained in close contact and communicated when

choices were made as to which evaluations to include. For example,

one point of discussion pertained to the amount of rigor required for an

evaluation to be included. While some articles, for example, included

multi-dataset comparisons of rendering speeds, some papers just re-

ported them for a single example or in a rough manner (“less than 1

second”). We decided not to exclude evaluations based on rigor but

chose to exclude cases that just reported anecdotal evidence. Papers

that were unclear were marked and re-coded by a second coder.

4 RESULTS AND DISCUSSION

Collaboratively coding this broad set of papers led to many intensive

discussions among the authors regarding current evaluation practices

in our ﬁeld, the meaning of rigorous and convincing evaluation with

respect to what we observed in the papers, as well as the range of differ-

ent evaluation approaches that are covered within the categories. In this

section, we summarize both quantitative and qualitative observations

from our literature analysis and our discussions about them.

In total we coded 581 papers from the IEEE Visualization conference

as discussed in Sect. 3.

Out of these, 569 (97%) included at least one

type of evaluation from our code set and 441 (76%) at least one out

of the original seven scenarios [

]. In total, we coded 1002 scenarios

spread across the 569 papers, meaning that many papers included

several evaluations with differing goals. Fig. 1(a) shows a histogram of

the total number of papers coded per scenario while Fig. 1(b) gives a

historic overview of the spread of evaluation scenarios coded, in percent

of all papers in a given year. Next, we discuss more detailed ﬁndings.

4.1 Evaluation Scenarios

The original seven scenarios were grouped by Lam et al. into two main

categories: understanding visualizations (UP, UE, AP) and understand-

ing data analysis processes (UWP, VDAR, CDA, CTV). Our new code

The data can be found in a Google Spreadsheet at http://goo.gl/CGswy .

100

150

200

250

300

350

400

450

QRI AP UE UP UWP VDAR CDA CTV

(a)

Total number of papers per scenario (years: 1997, 2000, 2003, 2006–2012).

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1997

2000

2003

2006

2007

2008

2009

2010

2011

2012

Visualization

Process

QRI

AP (VA)

UWP

VDAR

CDA

CTV

(b) Scenarios in percent of papers per year coded.

Fig. 1. Evaluation scenarios for IEEE Visualization conference papers.

QRI falls into the visualization category, leading to 955/1002 (95%)

of all evaluation scenarios we found present in the coded papers to be

of this category. We found only 47 scenarios (4.7%) that studied pro-

cesses of data analysis. We found no instance of a study that assessed

communicative value of a visualization (CTV).

The most common visualization scenario was QRI (46% of all sce-

narios) followed by AP (35% of all scenarios). Both scenarios together

covered 81% of all the scenarios we coded. Evaluations in both QRI

and AP were always conducted (by deﬁnition) without actual study

participants. Interestingly, these two codes differ sharply from the other

two in this category that involve studying participants: UE (9.3% of all

scenarios) and UP (4.6% of all scenarios).

The following subsections discuss our most interesting ﬁndings and

observations in regards to speciﬁc scenarios in more detail.

4.1.1 AP: The Importance of FPS and Memory Footprint

The reporting of performance of a (novel) algorithm, technique, or tool

was, with 35% of all coded scenarios, the second most frequent type

of evaluation we observed. Typically, authors reported computation

times for processing or rendering speeds in frames per second, for a

number of example datasets and a given platform. In the earlier papers,

we also frequently observed the reporting of memory footprints, albeit

less often in more recent papers—probably since memory shortage

seems to be less of a concern these days. Such performance metrics are

instructive because they inform the reader about the dataset sizes an

implementation or technique is applicable for, on a given platform.

Other objective metrics to evaluate a technique or implementation are

those that quantitatively assess the quality of a visualization algorithm.

Their goal is, therefore, to measure what a user can see or observe

without the need to study participants. In graph drawing, for instance,

quality measures such as the number of edge crossings are used as

criteria to assess the readability. For the visualization literature we

analyzed, this subset of AP evaluations included, for example, the

reporting of compression rates for shape compression techniques or

error metrics for the generated visuals. Such objective metrics to assess

a technique or its produced results was typically provided when some

kind of ground truth or other established quality standards (visual and

otherwise) existed against which results could be compared.

While frequently used as an evaluation approach, we saw a wide

range of reporting rigor in AP scenarios. In fact, it was when coding

this evaluation category that a discussion ensued among the authors at

which point the report of AP results should be called an ‘evaluation.’ In

particular, for time and memory performance we saw papers that simply

reported a single frames-per-second number for a single, speciﬁcally

selected dataset. Some papers just reported a range of rendering speeds

without further speciﬁcation of the dataset used or on which platform

they were produced. A popular but notoriously imprecise assessment

for rendering speeds was the term “interactive framerates” which can

mean anything from 1 fps to 120 fps or more. We decided not to code

such ‘performance evaluations’ as AP if they were limited to a single

measurement without a clear platform or dataset. In contrast, typical

evaluations gave rendering speeds for a number of different example

datasets and reported the used platform. Good evaluations analyzed

the behavior for a range of dataset sizes or used a number of additional

metrics to assess different concepts of visual quality of the results. A

good performance analysis was presented, for instance, by Lindstrom

and Isenburg [39]; a nice example of objectively analyzing the quality

of a proposed visualization result is Schultz and Seidel’s work [57].

4.1.2 QRI: Qualitative Result Inspection

As discussed previously in Sect. 3, we added one category to the

seven categories deﬁned by Lam et al.: the prevalent ‘qualitative results

inspection’ category (QRI). We included this code despite the fact that it

is not an evaluation in the traditional sense. Put simply, a QRI addresses

the reader of a paper and encourages him/her to agree on a quality

statement by inspecting a result image. An example statement could be

that “the ﬁgure shows that our tool can clearly depict structure x in the

data which was impossible with previous approaches.” While this is

just one example, we encountered a variety of different approaches and

factors of what we considered as QRI. While we decided to not further

split this category during the coding, we next discuss these details for a

better understanding of the breadth and rigor we saw in this scenario.

In essence, we found three important types:

Image Quality: The classical form of QRI we found was the qual-

itative discussion of images produced by a (rendering) algorithm.

A new algorithm was often targeted at producing images of a

certain quality and it was common to show and assess visually

that quality goals had been met (e. g., [48]).

Visual Encoding: Introducing a new visual encoding (e. g., a

novel transfer function or novel glyphs for vector and tensor

ﬁelds) was also quite common. An example was the introduction

of superquadric glyphs for second-order tensors by Schultz and

Kindlmann [

]. A QRI would in this case highlight what these

new encodings could show and how.

Walkthrough: We intentionally, however, did not limit the scope

of this category to visual encoding or image quality discussions,

as we also found instances of qualitative discussions of system

behavior (e. g., [

]) and interaction concepts (e. g., [

]). These

discussions convincingly validated the proposed contributions.

We found two major approaches in how QRI were conducted: com-

parative and isolated. Comparative result inspections had clear state-

of-the-art competitors that provided different solutions for the same

problem. The goal was to improve upon these current state-of-the-art

solutions. A typical approach was to compute output images with

different algorithms, including the newly proposed one, for a range of

different datasets. These images were then compared side-by-side and

the authors walked the reader through them to explain the differences

and beneﬁts of the newly proposed algorithm (e. g., [48]).

Another approach was to qualitatively inspect the results in isolation;

i. e., there was no clear competitor that addressed the same problem

which could be used for comparison. In those cases, a solid description

of the problem at hand as well as the justiﬁcation of how the proposed

new algorithm/technique/system addressed it was mandatory. Not

doing so resulted in a pure description much like a manual that failed

the purpose of evaluation—we did not code these descriptions.

A Systematic Review on the Practice of Evaluating Visualization

Figures

Citations

Participatory IT Design: Designing for Business and Workplace Realities

Visual Interaction with Dimensionality Reduction: A Structured Literature Analysis

Characterizing Provenance in Visualization and Data Analysis: An Organizational Framework of Provenance Types and Purposes

Visual Parameter Space Analysis: A Conceptual Framework

Curve Boxplot: Generalization of Boxplot for Ensembles of Curves

References

Content analysis: an introduction to its methodology

Qualitative research: a guide to design and implementation / Sharan B. Merriam

Qualitative Research: A Guide to Design and Implementation

The Logic of Scientific Discovery

Designing the User Interface: Strategies for Effective Human-Computer Interaction

Related Papers (5)

A Nested Model for Visualization Design and Validation

The eyes have it: a task by data type taxonomy for information visualizations

Design Study Methodology: Reflections from the Trenches and the Stacks

The challenge of information visualization evaluation

Toward a Deeper Understanding of the Role of Interaction in Information Visualization

Frequently Asked Questions (8)

Q1. What have the authors contributed in "A systematic review on the practice of evaluating visualization" ?

Q2. What future works have the authors mentioned in the paper "A systematic review on the practice of evaluating visualization" ?

Q3. What is the importance of reporting on study protocols?

Q4. What is the common pitfall in UE evaluations?

Q5. What is the main argument for the use of domain expert feedback?

Q6. What is the argument against evaluating studies?

Q7. What are the evaluation methods used to study UWP, VDAR, CTV,?

Q8. What is the APA’s general guidelines for reporting studies?