SARIMA Forecasts of Dengue Incidence in Brazil, Mexico,
Singapore, Sri Lanka, and Thailand: Model Performance and
the Significance of Reporting Delays
Pete Riley
1,5
, Michal Ben-Nun
1,5
, James Turtle
1
, David Bacon
3
, and Steven Riley
4
1 Predictive Science Inc., San Diego, CA, U.S.A.
2 Dengue Branch, Division of Vector-Borne Diseases, Centers for Disease Control and
Prevention, San Juan, Puerto Rico.
3 Leidos, Arlington, VA, U.S.A.
4 Imperial College London, London, England, U.K.
5 These authors contributed equally to this work.
* pete@predsci.com
Abstract
Timely and accurate knowledge of Dengue incidence is of value to public health
professionals because it helps to enable the precise communication of risk, improved
allocation of resources to potential interventions, and improved planning for the
provision of clinical care of severe cases. Therefore, many national public health
organizations make local Dengue incidence data publicly available for individuals and
organizations to use to manage current risk. The availability of these data has also
resulted in active research into the forecasting of Dengue incidence as a way to increase
the public health value of incidence data. Here, we robustly assess time-series-based
forecasting approaches against a null model (historical average incidence) for the
forecasting of incidence up to four months ahead. We used publicly available data from
multiple countries: Brazil, Mexico, Singapore, Sri Lanka, and Thailand; and found that
our time series methods are more accurate than the null model across all populations,
especially for 1- and 2-month ahead forecasts. We tested whether the inclusion of
PLOS 1/28
All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprintthis version posted June 29, 2020. ; https://doi.org/10.1101/2020.06.26.20141093doi: medRxiv preprint
NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.
climatic data improved forecast accuracy and found only modest, if any improvements.
We also tested whether national timeseries forecasts are more accurate if made from
aggregate sub-national forecasts, and found mixed results. We used our forecasting
results to illustrate the high value of increased reporting speed. This framework and
test data are available as an R package. The non-mechanistic approaches described here
motivates further research into the use of disease-dynamic models to increase the
accuracy of medium-term Dengue forecasting across multiple populations.
Author summary
Dengue is a mosquito-borne disease caused by the Dengue virus. Since the Second
World War it has evolved into a global problem, securing a foothold in more than 100
countries. Each year, hundreds of millions of people become infected, and upwards of
10,000 die from the disease. Thus, being able to accurately forecast the number of cases
likely to emerge in particular locations is vital for public health professionals to be able
to develop appropriate plans. In this study, we have refined a technique that allows us
to forecast the number of cases of Dengue in a particular location, up to four months in
advance. We test the approach using state-level and national-level data from Brazil,
Mexico, Singapore, Sri Lanka, and Thailand. We found that the model can generally
make useful forecasts, particularly on a two-month horizon. We tested whether
information about climatic conditions improved the forecast, and found only modest
improvements to the forecast. Our results highlight the need for both timely and
accurate reports. We also anticipate that this approach may be more generally useful to
the scientific community; thus, we are releasing a framework, which will allow interested
parties to replicate our work, as well as apply it to other sources of Dengue data, as well
as other infectious diseases in general.
Introduction 1
Forecasting the near- and long-term evolution of Dengue incidence within a country has
2
obvious value for policy makers. Dengue is a mosquito-borne disease caused by the 3
Dengue virus, affecting most tropical regions of the world [1]. Each year, between 50 4
PLOS 2/28
All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprintthis version posted June 29, 2020. ; https://doi.org/10.1101/2020.06.26.20141093doi: medRxiv preprint
and 500 million people are infected with Dengue. Of these between 10,000 and 20,000 5
people die [2, 3]. In spite of the disease being endemic, seasons vary dramatically from 6
one to the next, sometimes by more than an order of magnitude [4]. Knowledge of the 7
estimated total number of cases, the timing of the peak, and near-term incidence can 8
allow public health personnel to allocate limited resources appropriately, particularly 9
when Dengue may be competing with other diseases. Accurate predictions of large 10
increases in incidence would allow health care managers to prepare for a surge of 11
patients, as well as more proactive interventions, such as vector control. 12
A number of statistical and mechanistic models have been developed with the aim of
13
modeling or forecasting Dengue in various settings [5–13]. While ultimately, it is likely 14
that mechanistic approaches [10,14], should outperform statistical [15, 16] and machine
15
learning (ML) approaches [17], our current understanding of the complex dynamics 16
associated with vector-borne diseases, as well as the limitations in available data, 17
suggest that statistical techniques should be considered first. Of the statistical 18
approaches, Seasonal Autoregressive Integrated Moving Average (SARIMA) models 19
have received the most attention [18
–
20]. Recently, these models have been applied in a
20
pseudo-forecasting mode to assess their performance. In one study, the best overall 21
performing model relied on lagged observations of one month, with three yearly lag 22
terms and the first yearly difference [6], suggesting that there was a long term trend in
23
the data and that the average of the last three years observations for a given month was
24
a good model if adjusted by a single most recent observation from the current year. 25
Notably, climate data did not appreciably improve the power of the SARIMA models. 26
In this study, we describe a statistical technique for predicting Dengue incidence 27
rates from one to four months in the future using a family of SARIMA models. We 28
apply the model to districts/provinces/states within five distinct countries (Brazil, 29
Mexico, Singapore, Sri Lanka, and Thailand) for which reliable monthly or weekly data
30
are available. 31
Results 32
We first examined the data for evidence of seasonality. We then applied a reasonably 33
exhaustive set of SARIMA models to these data, first without, and then with the 34
PLOS 3/28
All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprintthis version posted June 29, 2020. ; https://doi.org/10.1101/2020.06.26.20141093doi: medRxiv preprint
addition of co-variate data. Next we explored the effects of direct versus aggregate 35
forecasting, and finally, we investigated the effects of reporting time delays. 36
Periodicity/Seasonality in the incidence data 37
There is a clear seasonal component to the incidence profiles for Brazil (data publicly 38
available for the interval 2001-2012), Thailand (2007-2018), Mexico (1985-2017), and 39
Singapore (2005-2019) (Fig. 1). For Sri Lanka (2010-2019), the picture is more complex.
40
This is, at least in part, complicated by the fact that the peak values in 2017 were more
41
than six times higher than the average peak values over the previous decade. To explore
42
whether possible seasonal signatures exist, we applied a wavelet transform to district 43
level data of Sri Lanka (Fig. 2). With the exception of Ratnapura, there is no evidence
44
for a stable, annual peak (with a frequency of 12 months). On the other hand, there is
45
some evidence for a sustained signal at 28-32 months that is present in all of the top-five
46
districts (and most of the other 19 districts). In the right column we explore the idea 47
that outbreaks spread out from the capital, Colombo (blue line in each panel), to other
48
districts (red line in each panel) by plotting the average phase of the amplitude with a 49
frequency of 10-14 months; however, we find no consistent evidence for a lead/lag. 50
SARIMA Analysis 51
Application of the leading eight SARIMA models to each of the 52
states/provinces/regions within Brazil, Thailand, Mexico, and Singapore generally 53
demonstrated that the (1, 0, 0)(3, 0, 0)
12
SARIMA model performed best. For example, 54
comparison of eight SARIMA models across 76 Thai provinces with our null historical 55
model, showed that a simple monthly historical average with a 1-month-3-year lagged 56
regression model of either the direct observations or their first difference (i.e., 57
(1, 0, 0)(3, 0, 0)
12
or (1, 0, 0)(3, 1, 0)
12
, respectively) performed best across all provinces 58
(Fig. 3). Unsurprisingly, the Mean Absolute Error (MAE), tended to decrease moving 59
from the most populous to least populous provinces, while the Mean Relative Absolute
60
Error (MRAE) remained approximately constant from one province to another. Using 61
the ratio of MAE(SARIMA) to MAE(NULL) as a measure of Skill Score (
SS
) for each
62
SARIMA model, we infer that almost all eight of the SARIMA models outperformed 63
PLOS 4/28
All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprintthis version posted June 29, 2020. ; https://doi.org/10.1101/2020.06.26.20141093doi: medRxiv preprint
(d)
Brazil
2001−2012
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
Região Norte
Região Nordeste
Região Sudeste
Região Sul
Região Centro−Oeste
1
2
3
4
5
Mexico
1985−2017
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
Aguascalientes
Baja California
Baja California Sur
Campeche
Chiapas
Chihuahua
Coahuila
Colima
Durango
Guerrero
Hidalgo
Jalisco
México
Michoacán
Morelos
Nayarit
Nuevo León
Oaxaca
Puebla
Querétaro
Quintana Roo
San Luis Potosí
Sinaloa
Sonora
Tabasco
Tamaulipas
Veracruz
Yucatán
0
1
2
3
4
Sri Lanka
2010−2019
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
Ampara
Anuradhapura
Badulla
Batticaloa
Colombo
Galle
Gampaha
Jaffna
Kalutara
Kandy
Kegalle
Kilinochchi
Kurunegala
Mannar
Matale
Matara
Moneragala
Mullaitivu
Nuwara Eliya
Polonnaruwa
Puttalam
Ratnapura
Trincomalee
Vavuniya
0
1
2
3
Thailand
2007−2018
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
Northern Region
Central Region
North−Eastern Region
Southern Region
2.0
2.5
3.0
3.5
4.0
1.5 2.0 2.5 3.0
Singapore
2005−2019
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
(a)
(b)
(c)
(e)
Fig 1.
Heat maps for provinces or regions within (a) Brazil, (b) Mexico, (c) Sri Lanka
and (d) Thailand. Monthly cadence is shown in all four panels. (e) Weekly national
level incidence data for Singapore. In each panel, values represent
Log
10
(
I
+ 1), where
I
is the number of monthly or weekly cases in that region (or country).
PLOS 5/28
All rights reserved. No reuse allowed without permission.
(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprintthis version posted June 29, 2020. ; https://doi.org/10.1101/2020.06.26.20141093doi: medRxiv preprint