Cluster Analysis of Typhoon Tracks. Part I: General Properties

doi:10.1175/JCLI4188.1

SUZANA J. CAMARGO AND ANDREW W. ROBERTSON

International Research Institute for Climate and Society, The Earth Institute at Columbia University, Palisades, New York

SCOTT J. GAFFNEY AND PADHRAIC SMYTH

Department of Computer Science, University of California, Irvine, Irvine, California

MICHAEL GHIL*

Department of Atmospheric and Oceanic Sciences, and Institute for Geophysics and Planetary Physics, University of California,

Los Angeles, Los Angeles, California

(Manuscript received 6 January 2006, in final form 28 August 2006)

ABSTRACT

A new probabilistic clustering technique, based on a regression mixture model, is used to describe tropical

cyclone trajectories in the western North Pacific. Each component of the mixture model consists of a

quadratic regression curve of cyclone position against time. The best-track 1950–2002 dataset is described

by seven distinct clusters. These clusters are then analyzed in terms of genesis location, trajectory, landfall,

intensity, and seasonality.

Both genesis location and trajectory play important roles in defining the clusters. Several distinct types

of straight-moving, as well as recurving, trajectories are identified, thus enriching this main distinction found

in previous studies. Intensity and seasonality of cyclones, though not used by the clustering algorithm, are

both highly stratified from cluster to cluster. Three straight-moving trajectory types have very small within-

cluster spread, while the recurving types are more diffuse. Tropical cyclone landfalls over East and South-

east Asia are found to be strongly cluster dependent, both in terms of frequency and region of impact.

The relationships of each cluster type with the large-scale circulation, sea surface temperatures, and the

phase of the El Niño–Southern Oscillation are studied in a companion paper.

1. Introduction

Typhoons have a large socioeconomic impact in

many Asian countries. The risk of landfall of a typhoon

or tropical storm depends on its trajectory. These tra-

jectories, in turn, vary strongly with the season (Gray

1979; Harr and Elsberry 1991), as well as on interannual

(Chan 1985) and interdecadal time scales (Ho et al.

2004). However, current knowledge is largely qualita-

tive, and the probabilistic behavior of tropical cyclone

trajectories needs to be better understood in order to

isolate potentially predictable aspects of landfall. Well-

calibrated probabilistic seasonal predictions of landfall

risk could form an important tool in risk management.

Tropical cyclogenesis over the tropical northwest

(NW) Pacific takes place in a broad region west of the

date line, between about 8° and 25°N. South of 15°N,

most of these tropical cyclones (TCs) follow rather

straight west-northwestward tracks. About one-third of

them continue in this direction and make landfall in

southeast Asia and southern China. Most of the re-

mainder “recurve,” that is, slow down, turn northward,

and then accelerate eastward as they enter the midlati-

tude westerlies (e.g., Harr and Elsberry 1995). Another

fraction of TCs track northward over the ocean, posing

no threat to land.

The large-scale circulation of the atmosphere has a

* Additional affiliation: Département Terre-Atmosphére-

Océan, and Laboratoire de Météorologie Dynamique du CNRS/

IPSL, Ecole Normale Supérieure, Paris, France.

Corresponding author address: Dr. Suzana J. Camargo, Inter-

national Research Institute for Climate and Society, Monell 225,

61 Route 9W, Palisades, NY 10964-8000.

E-mail: suzana@iri.columbia.edu

15 J

ULY 2007 CAMARGO ET AL. 3635

DOI: 10.1175/JCLI4188.1

JCLI4188

predominant role in determining a TC’s motion through

the steering by the surrounding large-scale flow (e.g.,

Chan and Gray 1982; Franklin et al. 1996; Chan 2005).

The cyclone and the environment interact to modify the

surrounding flow (Wu and Emanuel 1995), and the vor-

tex is then advected (steered) by the modified flow.

One important dynamical factor is the beta drift, in-

volving the interaction of the cyclone, the planetary

vorticity gradient, and the environmental flow. This

leads TCs to move northwestward even in a resting

environment in the Northern Hemisphere (Adem 1956;

Holland 1983; Wu and Wang 2004). Other effects can

also be important: the interaction of tropical cyclones

with mountain ranges leads to significant variations in

tracks, as often occurs in Taiwan (Wu and Kuo 1999).

This two-part study explores the hypothesis that the

large observed spread of TC tracks over the tropical

NW Pacific can be described well by a small number of

clusters of tracks, or TC “regimes.” The observed TC

variability on seasonal and interannual time scales is

then interpreted in terms of changes in the frequency of

occurrence of these TC regimes. In this paper, we ex-

plore the basic attributes of the underlying clusters by

applying a new clustering technique to the best-track

dataset of the Joint Typhoon Warning Center (JTWC).

The technique employs a mixture of polynomial regres-

sion models (i.e., curves) to fit the geographical

“shape” of the trajectories (Gaffney and Smyth 1999,

2005; Gaffney 2004). Camargo et al. (2007, hereafter

Part II) examine relationships between the clusters we

describe in the present paper and the large-scale atmo-

spheric circulation, as well as the El Niño–Southern

Oscillation (ENSO).

In midlatitude meteorology, the concept of planetary

circulation regimes (Legras and Ghil 1985), sometimes

called weather regimes (Reinhold and Pierrehumbert

1982), has been introduced in attempting to connect the

observations of persistent and recurring midlatitude

flow patterns with large-scale atmospheric dynamics.

These midlatitude circulation regimes have intrinsic

time scales of several days to a week or more and exert

a control on local weather (e.g., Robertson and Ghil

1999). Longer time-scale variability of weather statistics

(TCs in our case) is a result of changes over time in the

frequency-of-occurrence of circulation regimes. This

paradigm of climate variability provides a counterpart

to wave-like decompositions of atmospheric variability,

allowing the connection to be made with oscillatory

phenomena (Ghil and Robertson 2002), such as the

Madden–Julian oscillation.

Circulation regimes have most often been defined in

terms of clustering, whether fuzzy (Mo and Ghil 1987)

or hierarchical (Cheng and Wallace 1993), in terms of

maxima in the probability density function (PDF) of

the large-scale, low-frequency flow (Molteni et al. 1990;

Kimoto and Ghil 1993a,b), as well as in terms of quasi

stationarity (Ghil and Childress 1987; Vautard 1990)

and, more recently, using a probabilistic Gaussian mix-

ture model (Smyth et al. 1999).

In the case of TC trajectories, the K-means method

(MacQueen 1967) has been used to study western

North Pacific (Elsner and Liu 2003) and North Atlantic

(Elsner 2003) TCs. In those studies, the grouping

was done according to the positions of maximum and

final hurricane intensity (i.e., the last position at which

the TC had hurricane intensity). In both basins, three

clusters were chosen to describe the trajectories. The

K-means approach has also been used to cluster

North Atlantic extratropical cyclone trajectories, where

6-hourly latitude–longitude positions over 3 days were

converted into 24-dimensional vectors suitable for clus-

tering (Blender et al. 1997).

The K-means method is a straightforward and widely

used partitioning method that seeks to assign each track

to one of K groups such that the total variance among

the groups is minimized. However, K-means cannot ac-

commodate tracks of different lengths, and we show

this to be a serious shortcoming for TCs. On a different

approach, Harr and Elsberry (1995) used fuzzy cluster

analysis and empirical orthogonal functions to describe

the spatial patterns associated with different typhoon

characteristics.

The finite mixture model used in this paper to fit the

geographical shape of the trajectories allows the clus-

tering to be posed in a rigorous probabilistic framework

and accommodates tropical cyclone tracks of different

lengths. These characteristics provide advantages over

the K-means method used in previous studies. The

main novelty here is to use an objective method to

classify the typhoon tracks based not only on a few

points of the trajectory, but on trajectory shape and

location.

The clustering methodology is briefly described in

section 2 and applied to the JTWC best-track dataset in

section 3. The two main trajectory types identified by

the cluster analysis correspond to straight movers and

recurvers; additional clusters correspond to more de-

tailed differences among these two main types, based

on location and track type. We study several character-

istics of the TCs in each cluster, including first position,

mean track, landfall, intensity, and lifetime, and com-

pare them with previous works in section 4. Discussion

and conclusions follow in section 5. In Part II, we study

how the large-scale circulation and ENSO affect each

cluster.

3636 JOURNAL OF CLIMATE VOLUME 20

2. Data and methodology

a. Data and definitions

The TC data used in this paper were based on the

JTWC best-track dataset available at 6-hourly sampling

frequency over the time interval 1950–2002 (Joint Ty-

phoon Warning Center 2005). The tracks were studied

over the western North Pacific, defined such that the

latitude–longitude of the TCs are inside the “rectangle”

(0°–60°N and 100°E–180°) during at least part of their

lifetimes. The clustering technique and the resulting

analysis were applied to a total of 1393 cyclone tracks.

We included only TCs with tropical storm intensity or

higher: tropical storms (TSs), both category 1 and 2

typhoons (TYs) as defined by the Saffir–Simpson scale

(Saffir 1977; Simpson and Riehl 1981), and intense ty-

phoons (ITYs; categories 3–5). Tropical depressions are

not included in the analysis.

The observed data quality is thought to be consider-

ably poorer during presatellite years (pre-1970). We

assume that although some of the TCs may be missing

in the JTWC (2005) database for the presatellite data,

especially those that remain over the ocean, the tracks

for those that do appear in the dataset are reliable, even

if their intensity is not. We repeated the cluster analysis

for the time interval 1970–2002 and found that the types

of tracks obtained in each cluster are essentially the

same. This verification lends credence to the data in the

earlier part of the record and demonstrates the robust-

ness of our results.

b. Clustering methodology

We present here a brief summary of the clustering

methodology (details are given in the appendix). A

more complete discussion is given by Gaffney (2004),

with an application of the clustering method to extra-

tropical cyclones over the North Atlantic (Gaffney et

al. 2007; a Matlab toolbox with the clustering algo-

rithms described in this paper is available online at

http://www.datalab.uci.edu/resources/CCT).

Our curve clustering method is based on the finite

mixture model (e.g., Everitt and Hand 1981), which

represents a data distribution as a convex linear com-

bination of component density functions. A key feature

of the mixture model is its ability to model highly non-

Gaussian (and possibly multimodal) densities using a

small set of basic component densities. Finite mixture

models have been widely used for clustering data in a

variety of areas (e.g., McLachlan and Basford 1988),

including the large-scale atmospheric circulation

(Smyth et al. 1999; Hannachi and O’Neill 2001).

Regression mixture models extend the standard mix-

ture modeling framework by replacing the marginal

component densities with conditional density compo-

nents. The new conditional densities are functions of

the data (i.e., cyclone position) conditioned on an in-

dependent variable (i.e., time). In this paper, the com-

ponent densities model a cyclone’s longitudinal and

latitudinal positions versus time using quadratic poly-

nomial regression functions, as discussed in Gaffney

(2004). The latitude and longitude positions are treated

as conditionally independent given the model, and thus

the complete function for a cyclone track is the product

of these two. Other models, such as higher-order poly-

nomials and splines can also be used within the mixture

framework, but the simple quadratic model appears to

offer the best trade-off between ease of interpretation

and goodness-of-fit.

Each trajectory (i.e., each cyclone track) is assumed

to be generated by one of K different regression mod-

els, each having its own shape parameters. The cluster-

ing problem is to (i) learn the parameters of all K mod-

els given the TC tracks, and (ii) infer which of the K

models are most likely to have generated each TC

track. Each track can be assigned to the mixture com-

ponent (and thus the cluster) that was most likely to

have generated that track given the model. In other

words, the assigned cluster has the highest posterior

probability given the track. An expectation maximiza-

tion (EM) algorithm for learning these model param-

eters can be defined in a manner similar to that for

standard (unconditional) mixtures (DeSarbo and Cron

1988; Gaffney and Smyth 1999; McLachlan and Krish-

nan 1997; McLachlan and Peel 2000). The resulting EM

algorithm is straightforward to implement and use, and

its computational complexity is linear in the number of

observations.

Certain preprocessing steps are typically performed

on the cyclone tracks prior to clustering. For example,

Blender et al. (1997) subtract the coordinates of the

initial points of each extratropical cyclone track so that

they all begin at the latitude–longitude position of 0°,

0°. In addition they also normalize the latitude and lon-

gitude measurements to have the same variance. In our

experiments below we did not use any such preprocess-

ing—clustering the tracks directly produced results that

were easier to interpret and more meaningful than the

clustering of preprocessed tracks.

c. Number of clusters

To select the most appropriate number of clusters,

we looked at both the in-sample and out-of-sample log-

likelihood values. The log-likelihood is defined as the

log-probability of the observed data under the model,

which can be seen as a goodness-of-fit metric for proba-

bilistic models. Used as an objective measure, one se-

15 JULY 2007 CAMARGO ET AL. 3637

lects the number of clusters for which the log-likelihood

is largest across a candidate set of values. Our resulting

in-sample score curve is shown in Fig. 1 (the out-of-

sample curve is similar and is not shown). The observed

log-likelihood values increased in direct relation to the

number of clusters, and thus did not directly provide an

optimal number of chosen clusters. In addition the

within-cluster spread is plotted in Fig. 2 and can be used

as an additional measure for goodness of fit. The curves

in Figs. 1 and 2 mirror each other, showing obvious

diminishing returns of improvement in fit beyond K ⫽

6–8, suggesting a reasonable stopping point somewhere

in-between.

To evaluate the values K ⫽ 6–8 as candidates for the

number of clusters, we also carried out a qualitative

analysis based on how much the track types differ from

one cluster to another as the number of clusters in-

creases. Preliminary results carried out with six clusters

(Camargo et al. 2004) are very similar to those pre-

sented here. The main difference is that one of the K ⫽

6 track types splits in two when K is set to 7, with

slightly different characteristics. Most of the results pre-

sented here and in Camargo et al. (2007) are not sen-

sitive to the choices between K ⫽ 6–8. As described by

Camargo et al. (2007), the choice of K ⫽ 7 is found to

produce particularly interpretable results with respect

to ENSO and was thus taken to be our final choice.

Figure 3 illustrates how the choice for the number of

clusters from K ⫽ 2–9 affects the final regression

curves. To emphasize differences in shape, the mean

regression trajectories are plotted with their initial po-

sitions collocated at the origin. The two main types of

TC behavior found in previous studies (Harr and Els-

berry 1991, 1995) are evident in these plots, namely,

“straight movers” and “recurvers.” The differentiation

between the two types is achieved for K ⱖ 3. For each

of these two broad types, additional clusters yield dif-

ferences in compass bearing for the straight movers and

differences in the recurving portion for the recurvers.

This remark is particularly valid for odd values of K

(Figs. 3b,c,f,h). Although some of the regression curves

look very similar in Fig. 3, their initial positions differ in

several cases and there are also differences in trajectory

length. Since the regression curves are plotted with the

same number of points, the distances between plotted

points are smaller or larger based on average speed

over such a period. It is interesting to note that along

the recurving trajectories, the points are very close to

each other within the recurving portion, showing that

TCs slow down before changing direction. The recurv-

ing usually occurs when the storms move from a region

of easterlies to a region of westerlies, with the wind

speed decreasing near the recurve point. It is important

to note, however, that the clustering technique has no

access to the wind fields.

The regression trajectories for the six, seven, and

eight clusters are shown in Fig. 4; in this case, the initial

positions were retained. Note that the odd (even) clus-

ters share greater similarity than adjacent values of K.

For the chosen number of clusters (K ⫽ 7), shown in

FIG. 1. Log-likelihood values for different number of TC track

clusters. The log-likehood values shown are the maximum of 16

runs, obtained by a random permutation of the tropical cyclones

given to the cluster model.

FIG. 2. Within cluster error for different number of TC track

clusters. The cluster error values shown are the minimum of 16

runs, obtained by a random permutation of the tropical cyclones

given to the cluster model.

3638 JOURNAL OF CLIMATE VOLUME 20

Fig. 4c there are four clusters of straight movers and

three of recurvers. Notice the strong separation be-

tween the clusters in terms of their genesis location: five

clusters have genesis positions near 10°N in latitude,

but spread in longitude from near the Philippines to just

west of the date line. The other two clusters (both re-

curvers) start near 20°N.

Looking at the population of each cluster in Table 1,

we see that there are three dominant clusters (A, B, and

C), each accounting for approximately 20% of the

tracks. Clusters D and E occur less often (13%), while

clusters F and G (each containing about 100 cyclones)

are relatively rare (8%). When only considering the last

33 yr, 1970–2002, the number and characteristics of the

clusters did not change (see section 2a), but their rela-

tive sizes did change somewhat (not shown), with the

dominant clusters (such as A and C) decreasing and the

least populated ones (E, F, and G) increasing. This sig-

nificant change in relative cluster sizes could be due to

either a decadal shift in the occurrence of tracks (Ho et

al. 2004), or to data issues, with fewer TCs being de-

tected over open waters before the satellite era.

3. Tropical cyclone clusters

a. Trajectories

The TC tracks in clusters A–G from the time interval

1983–2002 are shown in Fig. 5, along with the mean

regression curves for each cluster. For comparison, the

tracks of all TCs in the same time interval are also

shown (Fig. 5h). The figure illustrates the high degree

of geographic localization achieved by the cluster

analysis, mainly due to the fact that the tracks were not

reduced to a common origin before performing the

clustering. The spread about the mean track for the

straight-moving clusters B, D, and F is particularly

small. Although the mean regression trajectories of

FIG. 3. Mean regression trajectories of the western North Pacific TCs with (a) two, (b) three,

(c) four, (d) five, (e) six, (f) seven, (g) eight, and (h) nine clusters. The mean trajectories start

at 0° lat and lon, for plotting purposes only.

15 J

ULY 2007 CAMARGO ET AL. 3639

Cluster Analysis of Typhoon Tracks. Part I: General Properties

Figures

Citations

Climate phenomena and their relevance for future regional climate change

Climate Modulation of North Atlantic Hurricane Tracks

Global and Regional Aspects of Tropical Cyclone Activity in the CMIP5 Models

Cluster Analysis of Typhoon Tracks. Part II: Large-Scale Circulation and ENSO

A More General Framework for Understanding Atlantic Hurricane Variability and Trends

References

Typhoons Affecting Taiwan:Current Understanding and Future Challenges

Statistics and Dynamics of Persistent Anomalies

Multiple Flow Regimes in the Northern Hemisphere Winter. Part I: Methodology and Hemispheric Regimes

Dynamics of Weather Regimes: Quasi-Stationary Waves and Blocking.

Identification of cyclone-track regimes in the North Atlantic

Related Papers (5)

How Strong ENSO Events Affect Tropical Storm Activity over the Western North Pacific(.

The NCEP/NCAR 40-Year Reanalysis Project

Western North Pacific Tropical Cyclone Intensity and ENSO

Increasing destructiveness of tropical cyclones over the past 30 years

Tropical Cyclone Activity in the Northwest Pacific in Relation to the El Niño/Southern Oscillation Phenomenon