SARIMA Forecasts of Dengue Incidence in Brazil, Mexico, Singapore, Sri Lanka, and Thailand: Model Performance and the Significance of Reporting Delays

doi:10.1101/2020.06.26.20141093

SARIMA Forecasts of Dengue Incidence in Brazil, Mexico,

Singapore, Sri Lanka, and Thailand: Model Performance and

the Significance of Reporting Delays

Pete Riley

1,5

, Michal Ben-Nun

1,5

, James Turtle

1

, David Bacon

3

, and Steven Riley

4

1 Predictive Science Inc., San Diego, CA, U.S.A.

2 Dengue Branch, Division of Vector-Borne Diseases, Centers for Disease Control and

Prevention, San Juan, Puerto Rico.

3 Leidos, Arlington, VA, U.S.A.

4 Imperial College London, London, England, U.K.

5 These authors contributed equally to this work.

* pete@predsci.com

Abstract

Timely and accurate knowledge of Dengue incidence is of value to public health

professionals because it helps to enable the precise communication of risk, improved

allocation of resources to potential interventions, and improved planning for the

provision of clinical care of severe cases. Therefore, many national public health

organizations make local Dengue incidence data publicly available for individuals and

organizations to use to manage current risk. The availability of these data has also

resulted in active research into the forecasting of Dengue incidence as a way to increase

the public health value of incidence data. Here, we robustly assess time-series-based

forecasting approaches against a null model (historical average incidence) for the

forecasting of incidence up to four months ahead. We used publicly available data from

multiple countries: Brazil, Mexico, Singapore, Sri Lanka, and Thailand; and found that

our time series methods are more accurate than the null model across all populations,

especially for 1- and 2-month ahead forecasts. We tested whether the inclusion of

PLOS 1/28

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprintthis version posted June 29, 2020. ; https://doi.org/10.1101/2020.06.26.20141093doi: medRxiv preprint

NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.

climatic data improved forecast accuracy and found only modest, if any improvements.

We also tested whether national timeseries forecasts are more accurate if made from

aggregate sub-national forecasts, and found mixed results. We used our forecasting

results to illustrate the high value of increased reporting speed. This framework and

test data are available as an R package. The non-mechanistic approaches described here

motivates further research into the use of disease-dynamic models to increase the

accuracy of medium-term Dengue forecasting across multiple populations.

Author summary

Dengue is a mosquito-borne disease caused by the Dengue virus. Since the Second

World War it has evolved into a global problem, securing a foothold in more than 100

countries. Each year, hundreds of millions of people become infected, and upwards of

10,000 die from the disease. Thus, being able to accurately forecast the number of cases

likely to emerge in particular locations is vital for public health professionals to be able

to develop appropriate plans. In this study, we have refined a technique that allows us

to forecast the number of cases of Dengue in a particular location, up to four months in

advance. We test the approach using state-level and national-level data from Brazil,

Mexico, Singapore, Sri Lanka, and Thailand. We found that the model can generally

make useful forecasts, particularly on a two-month horizon. We tested whether

information about climatic conditions improved the forecast, and found only modest

improvements to the forecast. Our results highlight the need for both timely and

accurate reports. We also anticipate that this approach may be more generally useful to

the scientific community; thus, we are releasing a framework, which will allow interested

parties to replicate our work, as well as apply it to other sources of Dengue data, as well

as other infectious diseases in general.

Introduction 1

Forecasting the near- and long-term evolution of Dengue incidence within a country has

2

obvious value for policy makers. Dengue is a mosquito-borne disease caused by the 3

Dengue virus, affecting most tropical regions of the world [1]. Each year, between 50 4

PLOS 2/28

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprintthis version posted June 29, 2020. ; https://doi.org/10.1101/2020.06.26.20141093doi: medRxiv preprint

and 500 million people are infected with Dengue. Of these between 10,000 and 20,000 5

people die [2, 3]. In spite of the disease being endemic, seasons vary dramatically from 6

one to the next, sometimes by more than an order of magnitude [4]. Knowledge of the 7

estimated total number of cases, the timing of the peak, and near-term incidence can 8

allow public health personnel to allocate limited resources appropriately, particularly 9

when Dengue may be competing with other diseases. Accurate predictions of large 10

increases in incidence would allow health care managers to prepare for a surge of 11

patients, as well as more proactive interventions, such as vector control. 12

A number of statistical and mechanistic models have been developed with the aim of

13

modeling or forecasting Dengue in various settings [5–13]. While ultimately, it is likely 14

that mechanistic approaches [10,14], should outperform statistical [15, 16] and machine

15

learning (ML) approaches [17], our current understanding of the complex dynamics 16

associated with vector-borne diseases, as well as the limitations in available data, 17

suggest that statistical techniques should be considered first. Of the statistical 18

approaches, Seasonal Autoregressive Integrated Moving Average (SARIMA) models 19

have received the most attention [18

–

20]. Recently, these models have been applied in a

20

pseudo-forecasting mode to assess their performance. In one study, the best overall 21

performing model relied on lagged observations of one month, with three yearly lag 22

terms and the first yearly difference [6], suggesting that there was a long term trend in

23

the data and that the average of the last three years observations for a given month was

24

a good model if adjusted by a single most recent observation from the current year. 25

Notably, climate data did not appreciably improve the power of the SARIMA models. 26

In this study, we describe a statistical technique for predicting Dengue incidence 27

rates from one to four months in the future using a family of SARIMA models. We 28

apply the model to districts/provinces/states within five distinct countries (Brazil, 29

Mexico, Singapore, Sri Lanka, and Thailand) for which reliable monthly or weekly data

30

are available. 31

Results 32

We first examined the data for evidence of seasonality. We then applied a reasonably 33

exhaustive set of SARIMA models to these data, first without, and then with the 34

PLOS 3/28

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprintthis version posted June 29, 2020. ; https://doi.org/10.1101/2020.06.26.20141093doi: medRxiv preprint

addition of co-variate data. Next we explored the effects of direct versus aggregate 35

forecasting, and finally, we investigated the effects of reporting time delays. 36

Periodicity/Seasonality in the incidence data 37

There is a clear seasonal component to the incidence profiles for Brazil (data publicly 38

available for the interval 2001-2012), Thailand (2007-2018), Mexico (1985-2017), and 39

Singapore (2005-2019) (Fig. 1). For Sri Lanka (2010-2019), the picture is more complex.

40

This is, at least in part, complicated by the fact that the peak values in 2017 were more

41

than six times higher than the average peak values over the previous decade. To explore

42

whether possible seasonal signatures exist, we applied a wavelet transform to district 43

level data of Sri Lanka (Fig. 2). With the exception of Ratnapura, there is no evidence

44

for a stable, annual peak (with a frequency of 12 months). On the other hand, there is

45

some evidence for a sustained signal at 28-32 months that is present in all of the top-five

46

districts (and most of the other 19 districts). In the right column we explore the idea 47

that outbreaks spread out from the capital, Colombo (blue line in each panel), to other

48

districts (red line in each panel) by plotting the average phase of the amplitude with a 49

frequency of 10-14 months; however, we find no consistent evidence for a lead/lag. 50

SARIMA Analysis 51

Application of the leading eight SARIMA models to each of the 52

states/provinces/regions within Brazil, Thailand, Mexico, and Singapore generally 53

demonstrated that the (1, 0, 0)(3, 0, 0)

12

SARIMA model performed best. For example, 54

comparison of eight SARIMA models across 76 Thai provinces with our null historical 55

model, showed that a simple monthly historical average with a 1-month-3-year lagged 56

regression model of either the direct observations or their first difference (i.e., 57

(1, 0, 0)(3, 0, 0)

12

or (1, 0, 0)(3, 1, 0)

12

, respectively) performed best across all provinces 58

(Fig. 3). Unsurprisingly, the Mean Absolute Error (MAE), tended to decrease moving 59

from the most populous to least populous provinces, while the Mean Relative Absolute

60

Error (MRAE) remained approximately constant from one province to another. Using 61

the ratio of MAE(SARIMA) to MAE(NULL) as a measure of Skill Score (

SS

) for each

62

SARIMA model, we infer that almost all eight of the SARIMA models outperformed 63

PLOS 4/28

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprintthis version posted June 29, 2020. ; https://doi.org/10.1101/2020.06.26.20141093doi: medRxiv preprint

(d)

Brazil

2001−2012

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

Região Norte

Região Nordeste

Região Sudeste

Região Sul

Região Centro−Oeste

1

2

3

4

5

Mexico

1985−2017

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

Aguascalientes

Baja California

Baja California Sur

Campeche

Chiapas

Chihuahua

Coahuila

Colima

Durango

Guerrero

Hidalgo

Jalisco

México

Michoacán

Morelos

Nayarit

Nuevo León

Oaxaca

Puebla

Querétaro

Quintana Roo

San Luis Potosí

Sinaloa

Sonora

Tabasco

Tamaulipas

Veracruz

Yucatán

0

1

2

3

4

Sri Lanka

2010−2019

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

Ampara

Anuradhapura

Badulla

Batticaloa

Colombo

Galle

Gampaha

Jaffna

Kalutara

Kandy

Kegalle

Kilinochchi

Kurunegala

Mannar

Matale

Matara

Moneragala

Mullaitivu

Nuwara Eliya

Polonnaruwa

Puttalam

Ratnapura

Trincomalee

Vavuniya

0

1

2

3

Thailand

2007−2018

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

Northern Region

Central Region

North−Eastern Region

Southern Region

2.0

2.5

3.0

3.5

4.0

1.5 2.0 2.5 3.0

Singapore

2005−2019

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

(a)

(b)

(c)

(e)

Fig 1.

Heat maps for provinces or regions within (a) Brazil, (b) Mexico, (c) Sri Lanka

and (d) Thailand. Monthly cadence is shown in all four panels. (e) Weekly national

level incidence data for Singapore. In each panel, values represent

Log

10

(

I

+ 1), where

I

is the number of monthly or weekly cases in that region (or country).

PLOS 5/28

(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprintthis version posted June 29, 2020. ; https://doi.org/10.1101/2020.06.26.20141093doi: medRxiv preprint

SARIMA Forecasts of Dengue Incidence in Brazil, Mexico, Singapore, Sri Lanka, and Thailand: Model Performance and the Significance of Reporting Delays

Figures

Citations

Predicting Dengue Fever in Brazilian Cities

Applicability of SARIMA Model in Tokyo Population Migration Forecast

A systematic review of dengue outbreak prediction models: Current scenario and future directions

Neural Networks for Dengue Prediction: A Systematic Review.

References

The global distribution and burden of dengue

The global burden of dengue: an analysis from the Global Burden of Disease Study 2013

A systematic review of mathematical models of mosquito-borne pathogen transmission: 1970–2010

Interactions between serotypes of dengue highlight epidemiological impact of cross-immunity

Weather as an effective predictor for occurrence of dengue fever in Taiwan.

Related Papers (5)

Probabilistic seasonal dengue forecasting in Vietnam: A modelling study using superensembles.

An open challenge to advance probabilistic forecasting for dengue epidemics.

Evaluating probabilistic dengue risk forecasts from a prototype early warning system for Brazil

Challenges in Real-Time Prediction of Infectious Disease: A Case Study of Dengue in Thailand.

Fine-grained dengue forecasting using telephone triage services