Citation-based clustering of publications using CitNetExplorer and VOSviewer

doi:10.1007/S11192-017-2300-7

Citation-based clustering of publications using

CitNetExplorer and VOSviewer

Nees Jan van Eck

1

•

Ludo Waltman

1

Received: 6 June 2016 / Published online: 27 February 2017

 The Author(s) 2017. This article is published with open access at Springerlink.com

Abstract Clustering scientiﬁc publications in an important problem in bibliometric

research. We demonstrate how two software tools, CitNetExplorer and VOSviewer, can be

used to cluster publications and to analyze the resulting clustering solutions. CitNetEx-

plorer is used to cluster a large set of publications in the ﬁeld of astronomy and astro-

physics. The publications are clustered based on direct citation relations. CitNetExplorer

and VOSviewer are used together to analyze the resulting clustering solutions. Both tools

use visualizations to support the analysis of the clustering solutions, with CitNetExplorer

focusing on the analysis at the level of individual publications and VOSviewer focusing on

the analysis at an aggregate level. The demonstration provided in this paper shows how a

clustering of publications can be created and analyzed using freely available software

tools. Using the approach presented in this paper, bibliometricians are able to carry out

sophisticated cluster analyses without the need to have a deep knowledge of clustering

techniques and without requiring advanced computer skills.

Keywords Citation  Clustering  CitNetExplorer  VOSviewer

Introduction

Clustering techniques play a prominent role in bibliometric research. They are for instance

used to identify groups of related publications, authors, or journals. Clustering techniques

have been developed mainly in ﬁelds such as statistics, computer science, and network

science. Bibliometricians usually do not develop their own clustering techniques, but they

& Nees Jan van Eck

ecknjpvan@cwts.leidenuniv.nl

Ludo Waltman

waltmanlr@cwts.leidenuniv.nl

1

Centre for Science and Technology Studies, Leiden University, Leiden, The Netherlands

123

Scientometrics (2017) 111:1053–1070

DOI 10.1007/s11192-017-2300-7

use existing clustering techniques developed in other ﬁelds. They apply these techniques to

bibliometric data sets, sometimes after adapting the techniques to the speciﬁc character-

istics of bibliometric data.

When the number of objects to be clustered is relatively limited (e.g., at most a few

hundred objects), analyzing and interpreting the results obtained from a clustering tech-

nique usually does not cause any signiﬁcant difﬁculties. However, when dealing with large

numbers of objects, analyzing and interpreting a clustering solution is far from straight-

forward. This can be a problem especially when clustering techniques are applied at the

level of individual publications. We may then have clustering solutions that include many

thousands or even many millions of publications (e.g., Boyack and Klavans

2014; Klavans

and Boyack

2017; Waltman and Van Eck 2012). Making sense of these clustering solutions

can be a serious challenge.

In this paper, our aim is to demonstrate how two software tools that we have developed,

CitNetExplorer (Van Eck and Waltman

2014a, b; www.citnetexplorer.nl) and VOSviewer

(Van Eck and Waltman

2010, 2014b; www.vosviewer.com), can be used to cluster pub-

lications and to analyze the resulting clustering solutions. We use CitNetExplorer to cluster

publications based on their citation relations and to analyze the resulting clustering solu-

tions at the level of individual publications. We use VOSviewer to analyze the clustering

solutions obtained using CitNetExplorer at an aggregate level. CitNetExplorer and VOS-

viewer both rely strongly on visualizations to facilitate the analysis of clustering solutions.

CitNetExplorer, which is an abbreviation of ‘citation network explorer’, is a software

tools that we have developed for analyzing and visualizing citation networks. In the

approach that we take in this paper, we ﬁrst use CitNetExplorer to cluster publications

based on their citation relations. For this purpose, CitNetExplorer employs a clustering

technique that we have introduced in earlier papers (Waltman and Van Eck

2012, 2013 ).

We then use CitNetExplorer to analyze the resulting clustering solution at the level of

individual publications. To facilitate the analysis of a clustering solution, the following

features of CitNetExplorer are essential:

• Visualizing a citation network. CitNetExplorer can be used to visualize a citation

network of publications, with publications shown along a time axis and with colors

indicating the clusters to which publications belong. Using the visualization

functionality of CitNetExplorer, we obtain an overview of the most frequently cited

publications in a citation network, the citation relations between these publications, and

the clusters to which the publications belong.

• Drilling down into a citation network. The drill down functionality of CitNetExplorer

can be used to analyze a clustering solution at different levels of detail. We may for

instance start with a visualization at the level of the entire citation network. We may

then perform a drill down into one or more selected clusters, after which we are

provided with a visualization at the level of the subnetwork consisting of the

publications belonging to the selected clusters.

• Searching for publications. We can search for publications based on title, publication

year, author name, and journal name. The search functionality of CitNetExplorer can

be used to ﬁnd publications that are of special interest, for instance all publications in a

speciﬁc journal, and to ﬁnd out to which clusters these publications belong.

VOSviewer is a software tool for constructing and visualizing bibliometric networks. In

this paper, VOSviewer is used to complement CitNetExplorer. While we use CitNetEx-

plorer to analyze a clustering solution at the level of individual publications, we use

VOSviewer to analyze a clustering solution at an aggregate level. Two visualizations

1054 Scientometrics (2017) 111:1053–1070

123

provided by VOSviewer play an important role. The ﬁrst visualization shows the clusters in

a clustering solution and the citation relations between these clusters. The second visu-

alization uses a so-called term map to indicate the topics that are covered by a cluster. This

visualization shows the most important terms occurring in the publications belonging to a

cluster and the co-occurrence relations between these terms.

This paper is organized as follows. ‘‘

Clustering technique’’ section discusses the clus-

tering technique that is used by CitNetExplorer to cluster publications based on their

citation relations. ‘‘

Results’’ section demonstrates the use of CitNetExplorer and VOS-

viewer to cluster publications and to analyze the resulting clustering solutions.

CitNetExplorer is used to cluster more than 100,000 publications in the ﬁeld of astronomy

and astrophysics, and CitNetExplorer and VOSviewer are used together to analyze the

resulting clustering solutions. ‘‘Conclusion’’ section concludes the paper.

Clustering technique

In this paper, we use the clustering technique that is available in the CitNetExplorer

software tool. This section provides a discussion of this clustering technique. ‘‘

Determining

the relatedness of publications

’’ section explains how the relatedness of publications is

determined, and ‘‘Clustering publications’’ section describes how publications are assigned

to clusters. We refer to Waltman and Van Eck (

2012, 2013) for a more extensive dis-

cussion of our clustering technique.

Determining the relatedness of publications

To cluster publications, we ﬁrst need to determine the relatedness of publications. In the

bibliometric literature, the most commonly used approaches to determine the relatedness of

publications are based on either citation relations or word relations (for a more extensive

discussion, see Van Eck and Waltman

2014b). In the case of citation relations, a further

distinction can be made between direct citation relations, bibliographic coupling relations,

and co-citation relations (e.g., Boyack and Klavans

2010; Klavans and Boyack 2017). In

the case of word relations, shared words in the titles, abstracts, or full texts of publications

serve as an indication of the relatedness of publications (e.g., Boyack et al.

2011; Janssens

et al.

2006). Sometimes the relatedness of publications is determined using a combined

approach that takes into account both citation relations and word relations (e.g., Boyack

and Klavans

2010; Janssens et al. 2008).

Our clustering technique determines the relatedness of publications based on direct

citation relations. We prefer to use citation relations rather than word relations because the

use of word relations involves some difﬁculties. Some words have a different meaning in

different ﬁelds of science. These words may incorrectly indicate that publications from

different ﬁelds are related to each other. Also, some words are very general and are used in

many different ﬁelds. These words do not provide useful information on the relatedness of

publications.

We prefer to use direct citation relations rather than bibliographic coupling relations

(i.e., relations between publications that cite the same publication) or co-citation relations

(i.e., relations between publications that are cited by the same publication) for two reasons.

First, bibliographic coupling and co-citation relations are indirect relations, and we

therefore expect them to provide less accurate information on the relatedness of

Scientometrics (2017) 111:1053–1070 1055

123

publications than direct citation relations (Waltman and Van Eck 2012). Second, there are

many more bibliographic coupling or co-citation relations between publications than direct

citation relations, and therefore the use of bibliographic coupling or co-citation relations

may easily lead to computational problems. (This also applies to the use of word relations.)

Although we prefer the use of direct citation relations over the use of bibliographic cou-

pling or co-citation relations, we acknowledge that the use of direct citation relations also

has a disadvantage. Within the period of analysis, some publications may have no direct

citation relations with other publications. When using direct citation relations, these

publications cannot be properly assigned to a cluster. This problem is especially serious

when the period of analysis is relatively short. When using bibliographic coupling relations

rather than direct citation relations, one usually does not have this problem. We note that,

in addition to our own work, the use of direct citation relations is also advocated in recent

work by Klavans and Boyack (

2017).

Clustering publications

After the relatedness of publications has been determined, our clustering technique assigns

publications to clusters. Each publication is assigned to exactly one cluster. Hence, there is

no overlap of clusters and there are no publications without a cluster assignment. It may be

argued that there should be room for publications to be assigned to more than one cluster.

However, allowing publications to be assigned to multiple clusters introduces signiﬁcant

technical challenges. For this reason, we prefer to assign publications to a single cluster

only. For most publications, we believe that it is reasonable to assign them to just one

cluster.

Publications are assigned to clusters by maximizing a quality function. The quality

function that is used has been introduced in an earlier paper (Waltman and Van Eck

2012).

This quality function is a variant of the well-known modularity function of Newman and

Girvan (

2004) and Newman (2004) developed in the ﬁeld of network science. The quality

function is very similar to the quality function resulting from the so-called constant Potts

model proposed by Traag et al. (

2011). Our quality function has an important advantage

over the popular modularity function. The modularity function suffers from a problem

known as the resolution limit (Fortunato and Barthe

´

lemy

2007). This problem causes the

modularity function to yield counterintuitive results in certain situations. As shown by

Traag et al. (

2011), our quality function does not suffer from the resolution limit problem.

More speciﬁcally, our clustering technique assigns publications to clusters by maxi-

mizing the quality function

Qðx

1

; ...; x

n

Þ¼

X

n

i¼1

X

n

j¼1

d x

i

; x

j



a

ij



c

2n



; ð1Þ

where n denotes the number of publications, a

ij

denotes the relatedness of publication i

with publication j, c denotes a so-called resolution parameter, and x

i

denotes the cluster to

which publication i is assigned. The function d(x

i

, x

j

) equals 1 if x

i

= x

j

and 0 otherwise.

The relatedness of publication i with publication j is given by

a

ij

¼

c

ij

P

n

k¼1

c

ik

; ð2Þ

where c

ij

equals 1 if either publication i cites publication j or publication j cites publication

i and c

ij

equals 0 otherwise. Hence, if there is a direct citation relation between publications

1056 Scientometrics (2017) 111:1053–1070

123

i and j, the relatedness of publication i with publication j is inversely proportional to the

total number of direct citation relations of publication i. If there is no direct citation

relation between publications i and j, the relatedness of the publications equals 0. Notice

that our clustering technique ignores the direction of a citation (i.e., no distinction is made

between publication i citing publication j and publication j citing publication i).

The value of the resolution parameter c in (1) should be chosen based on the purpose of

the cluster analysis. Higher values of this parameter will yield a larger number of clusters.

In other words, the higher the value of c, the higher the level of detail of the clustering

solution that will be obtained. In CitNetExplorer, the default value of c is 1. However, we

emphasize that there is no generally optimal value of c. Our recommendation to users of

our clustering technique is to try out different values of c and to choose the value that

seems to give the most useful results for the speciﬁc needs of a user.

In order to maximize the quality function in (

1), our clustering technique uses the smart

local moving algorithm introduced by Waltman and Van Eck (

2013). This algorithm offers

a more sophisticated alternative to the popular Louvain algorithm for modularity opti-

mization (Blondel et al.

2008). When the smart local moving algorithm and the Louvain

algorithm are given a similar amount of computing time, the smart local moving algorithm

typically identiﬁes a clustering solution with a signiﬁcantly higher value for the quality

function. We refer to Waltman and Van Eck (

2013) for an extensive comparison of the two

algorithms.

Our clustering technique usually identiﬁes a relatively limited number of larger clusters

and a more substantial number of smaller clusters. Sometimes clusters are very small and

for instance include only one or two publications. Because in many cases small clusters are

of limited interest, a minimum cluster size parameter can be speciﬁed. Clusters that are too

small can be either discarded or merged with other clusters. We refer to Waltman and Van

Eck (

2012) for a discussion of the approach that we take to merge small clusters with larger

ones.

Results

We now demonstrate how CitNetExplorer and VOSviewer can be used to cluster publi-

cations and to analyze the resulting clustering solutions. In our demonstration, we work

with a large data set of publications in the ﬁeld of astronomy and astrophysics. We

emphasize that in this paper it is not our aim to assess the quality of our clustering solutions

or to compare our clustering solutions with other alternative solutions. We do not have the

domain knowledge required to provide an in-depth interpretation of our clusters and to

assess their quality. For a comparison of our clustering solutions with other alternative

solutions, we refer to the comparison paper by Velden et al. (

2017) in this special issue.

Data

We use the ‘Astro data set’ that is also used in other papers in this special issue. A general

introduction to the data set is provided in the introductory paper by Gla

¨

ser et al. (

2017)in

this special issue. The data set was extracted from the Web of Science bibliographic

database. It includes all publications of the document types ‘article’, ‘letter’, and ‘pro-

ceedings paper’ published between 2003 and 2010 in journals belonging to the Web of

Science subject category ‘Astronomy and Astrophysics’. The number of publications in the

Scientometrics (2017) 111:1053–1070 1057

123

Citation-based clustering of publications using CitNetExplorer and VOSviewer

Citations

References

"Citation-based clustering of public..." refers methods in this paper

"Citation-based clustering of public..." refers methods in this paper

"Citation-based clustering of public..." refers methods in this paper

"Citation-based clustering of public..." refers methods in this paper

Related Papers (5)