What was the process used to perform the phylodynamic analyses?

Phylodynamic analyses were performed on subsampled sets of the data described above (Data and data pre-processing) using a birth-death-sampling process as implemented in the BDSKY [55] model in BEAST2 [5].

What was the sampling proportion for sD.2?

The sampling proportion s(t) = ψ(t)/(ψ(t)+ µ(t)) was a priori assumed to arise from a uniform distribution with a lower limit of zero and the upper limit determined by the ratio of analyzed sequences over diagnosed cases s ∼U (0,qi/di) where di is the number of diagnoses and qi the number of sequences included in the analysis in interval i.

How did the authors estimate the effective reproduction numbers of SARS-CoV-2?

In parallel, the authors estimated corresponding effective reproduction numbers R φ e (t) by applying the Wallinga-Teunis method [61] to incidence correlates φ derived by GInPipe.

What is the probability of a decline in case detection in Switzerland?

4. Interestingly, their method predicts a decline in case detection in Switzerland after the broad introduction of antigen self-testing in November 2020.

What was the effect of the expansion of testing capacities?

testing capacities were further expanded, especially in the health sector, including hospital patients, health and social care staff, with fairly stable case detection rates.

What is the proposed method for estimating the evolution of a viral outbreak?

A fully automated workflow has been generated using Snakemake [26] and is available from https://github.com/KleistLab/GInPipe.To test the proposed incidence reconstruction method, the authors stochastically simulated the evolutionary dynamics of a viral outbreak using a Poisson process formalism.

How did the case detection rate in Denmark increase from mid-May to mid-September?

compared to the fairly stable case detection levels from mid March to mid May, this policy change leads to a 2-3 fold drop in case detection in the summer months from July-September.

(Open Access) Rapid incidence estimation from SARS-CoV-2 genomes reveals decreased case detection in Europe during summer 2020 (2021) | Maureen Rebecca Smith

Q: How many hours of computation were required to reconstruct the full incidence histories for Denmark, Scotland, Switzerland?

The authors used GInPipe to reconstruct complete incidence histories for Denmark, Scotland, Switzerland, and Victoria (Australia) from publicly available full length SARS-CoV-2 sequencing data provided through GISAID [14, 54] (Supplementary Note 4).

Rapid incidence estimation from SARS-CoV-2

genomes reveals decreased case detection in

Europe during summer 2020

Maureen Smith

Robert Koch Institute

Maria Tromova

Robert Koch Institute

Ariane Weber

Max-Planck Institute

Yannick Duport

Robert Koch Institute

Denise Kühnert

Department of Archaeogenetics, Max Planck Institute for the Science of Human History, 07745 Jena,

Germany

https://orcid.org/0000-0002-5657-018X

Max von Kleist (  kleistm@rki.de )

MF1 Bioinformatics, Robert Koch-Institute https://orcid.org/0000-0001-6587-6394

Article

Keywords: SARS-CoV-2, epidemiology, genomes

Posted Date: May 27th, 2021

DOI: https://doi.org/10.21203/rs.3.rs-558667/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License. 

Read Full License

Version of Record: A version of this preprint was published at Nature Communications on October 14th,

2021. See the published version at https://doi.org/10.1038/s41467-021-26267-y.

Rapid incidence estimation from SARS-CoV-2

genomes reveals decreased case detection in

Europe during summer 2020

Maureen Rebecca Smith

1, 2,*,+

, Maria Troﬁmova

1,2,*

, Ariane Weber

, Yannick Duport

1,2

Denise K ¨uhnert

3,4

, and Max von Kleist

1,2,4,+

Systems Medicine of Infectious Disease (P5), Robert Koch Institute, Berlin, Germany

Bioinformatics (MF1), Robert Koch Institute, Berlin, Germany

Transmission, Infection, Diversiﬁcation and Evolution Group, Max-Planck Institute for the Science of Human

History, Jena, Germany

German COVID Omics Initiative (deCOI)

these authors contributed equally to this work

smithm@rki.de

kleistm@rki.de

ABSTRACT

By May 2021, over 160 million SARS-CoV-2 diagnoses have been reported worldwide. Yet, the true number of infections

is unknown and believed to exceed the reported numbers by several fold. National testing policies, in particular, can strongly

affect the proportion of undetected cases.

Here, we propose a novel method (GInPipe) that reconstructs SARS-CoV-2 incidence proﬁles within minutes, solely from

publicly available, time-stamped viral genomes. We validated GInPipe against in silico generated outbreak data and elaborate

phylodynamic analyses. We apply the method to reconstruct incidence histories from sequence data for Denmark, Scotland,

Switzerland, and Victoria (Australia). GInPipe reconstructs the different pandemic waves robustly and remarkably accurate. We

demonstrate how the method can be used to investigate the effects of changing testing policies on the probability to diagnose

and report infected individuals. Speciﬁcally, we ﬁnd that under-reporting was highest in mid 2020 in parts of Europe, coinciding

with changes towards more liberal testing policies at times of low testing capacities.

Due to the increased use of real-time sequencing, it is envisaged that GInPipe can complement established surveillance

tools to monitor the SARS-CoV-2 pandemic. We anticipate that the method is particularly useful in settings where diagnostic

and reporting infrastructures are insufﬁcient. In ‘post-pandemic’ times, when diagnostic efforts are decreased, GInPipe may

facilitate the detection of hidden infection dynamics.

Introduction

As of May 2021, the global SARS-CoV-2 pandemic is still ongoing in most parts of the world, with 160 million reported cases

worldwide. Novel vaccines of high efﬁcacy have been developed within a year of the outbreak [2, 46]. At the time of writing,

approximately 8.2% of the worlds population had already received at least one vaccination. However, distribution of vaccines is

uneven and achieving global herd immunity may pose an extremely difﬁcult, long-term task [63, 36]. At the same time, novel

variants of concern (VOC) have emerged in high prevalence regions [6, 34], which may be able to reinfect individuals [21, 37]

and escape vaccine elicited immune responses [33, 66, 45]. For example, Manaus, Brazil, witnessed a massive second wave of

infections [51], despite the fact that approx. 80% had already experienced an infection at the onset of the second wave [6].

Because of the evolutionary versatility of SARS-CoV-2 and difﬁculties in global vaccine distribution, some experts expect that

the virus may not be eliminated globally [44]. Even without adaptation to vaccines in the future, it has been postulated that

SARS-CoV-2 may resurge [24, 50] and surveillance may have to be maintained into the mid 2020s to monitor virus spread and

evolution [24].

Currently, the gold standard of SARS-CoV-2 surveillance is diagnostic testing via polymerase chain reaction (PCR) or antigen-

based rapid diagnostic testing (RDT). Diagnostic test results currently deﬁne infection case reports, which are used to survey

epidemiological dynamics and to deﬁne thresholds for travel bans and non-pharmaceutical measures. Inevitably, case reporting

data is affected by test coverage, which changes when testing policies are adapted. While RDT enables point-of-care diagnosis

and is less costly than PCR testing [13, 12], gathering and reporting of test results still requires a sophisticated infrastructure,

which is difﬁcult to establish and maintain in many developing countries [35]. Independent and complimentary sources of

information, such as social media reports [31, 53] or waste water analysis [9, 43] have been used early on to complement our

knowledge of the pandemic dynamics. In addition, many regions of the world sequence SARS-CoV-2 genomes to track virus

evolution and the emergence of variants of concern. The gathered viral sequences are regularly provided to public databases,

such as GISAID [14, 54]. We hypothesize that the genetic data alone holds information about the pandemic trajectory. More

speciﬁcally, we presume that the speed at which SARS-CoV-2 evolves on the population level contains information about the

number of individuals who are actively infected.

In the vast majority of cases, SARS-CoV-2 is transmitted within a very short period, only days after infection [30, 17]. The

consequence is a well-deﬁned duration of intra-patient evolutionary time before transmission. Thus, the number of infected

individuals is correlated to the rate of divergence of the viral population, implicating an ‘evolutionary signal’.

In this article, we introduce the computational pipeline GInPipe, which only uses time-stamped sequencing data, extracts

the ‘evolutionary signal’ and reconstructs SARS-CoV-2 incidence histories. The approach builds on recent work by Khatri and

Burt [23], who derived a simple function that relates the mean number of mutant origins to the current allele frequency and

the mutational input, which is proportional to the effective population size. Herein, due to the short window of transmission,

we anticipate that the effective population size may strongly correlate with the incidence of SARS-CoV-2. We adapt the

function derived in [23] and embed it into an automatic computational pipeline (GInPipe) that reconstructs the time course of an

incidence correlate

merely from SARS-CoV-2 genetic data. GInPipe is validated threefold and performs robustly: (i) against

in silico generated outbreak data, (ii) against phylodynamic analysis and (iii) in comparison with case reporting data. We

applied the method to SARS-CoV-2 sequencing data from Denmark, Scotland, Switzerland, and the Australian state Victoria to

reconstruct their respective incidence histories. Lastly, we utilize the inferred epidemic trajectories to compute changes in the

probability that an infected individual is reported and highlight how this probability is affected by changes in testing policies.

Results

Incidence reconstruction

An outline of GInPipe for SARS-CoV-2 incidence reconstruction is shown in Figure 1A-C. After compiling a set of time-

stamped, full-length SARS-CoV-2 genomes, the sequences are placed into temporal bins

(Fig. 1A). For each bin, we compute

the number of mutant sequences

, as well as the number of haplotypes

. These two inputs are used to infer the incidence

correlate

(Fig. 1B). We then smooth over all

point estimates and derive a reconstructed incidence history along the time

axis ( Fig. 1C). The reconstructed incidence histories can then be used as a basis to estimate the effective reproduction number

, as well as the relative case detection rate as outlined below.

Method validation: in silico experiment

To test whether GInPipe correctly reconstructs incidence histories, we ﬁrst performed an in silico experiment. We considered

a population of

N(t)

infected individuals at time

that stochastically generate

N(t + 1)

infected individuals in the next time

step

t +1

. Each individual is associated with a virus sequence, which can mutate randomly. Individuals can be removed (the

associated sequence is removed), or they transmit their virus (the associated virus is copied over). We record the number of

infected individuals per generation, as well as all sequences of the currently circulating viruses. We then use the simulated viral

sequences to infer φ (t) and reconstruct the incidence history, as presented in Figure 1D-E.

In Figure 1D, we compare one trajectory of simulated population sizes with the reconstructed incidence histories. The simulated

outbreak (red line, right axis) consists of two waves of increasing magnitude. GInPipe reconstructs these dynamics (blue lines

and dots, left axis) quite accurately, although the incidence correlate

φ(t)

is on a different scale, implying a linear correlation

to the number of infected individuals. To assess this correlation, we performed 10 stochastic simulations and compared the

φ(t)

point estimates with the corresponding number of infected individuals (Fig. 1E). We observed a strong (

r = 0.96

) and

highly signiﬁcant (

p < 10

−16

) linear relationship between the number of infected individuals

N(t)

and the method’s incidence

correlate φ (t).

While these simulations represent idealized scenarios, we evaluated the robustness of GInPipe with regards to incomplete, and

sparse data sets, thoroughly elaborated in Supplementary Note 1.

Our analyses showed, that the method can still accurately reconstruct incidence histories over time, when data is missing or when

data sampling is unbalanced. In scenarios of extreme under-sampling, the

point estimates are prone to slight underestimation.

However, through the smoothing step the reconstructed incidence trajectories still follow the overall population dynamics

(Suppl. Note 1, section SN.1.7). Finally, we evaluated whether introductions of foreign sequences affect the reconstruction of

incidence histories. Even for extreme and unrealistic cases, a stable reconstruction of the underlying dynamic is possible, but

2/15

A Re-sampling of

sequence sets

B Determination of

C Smoothing

D Reconstruction of a

simulated outbreak

E Simulated incidence vs.

reconstruction

0 1000 2000 3000 4000

true

est

250

500

750

1000

1250

0 20 40 60 80 100 120

number of generations

est

true

Figure 1. Reconstruction of incidence histories using the proposed method. A–C Schematic of the incidence reconstruction method. A

The sequences are chronologically ordered by collection date. The line shows the cumulative sum of sequences over time. The sequences are

allocated into temporal bins, spanning either the same time frame

∆d

(yellow and purple bins) or containing the same amount of sequences

(green bins). B For each bin, the number of distinct variants h

, as well as the total amount of mutant sequences m

are used to infer the

incidence correlate φ

. C The point estimates for all bins

(dots) are smoothed with a convolution ﬁlter. For uncertainty estimation, the

point estimates are sub-sampled and interpolated. D–E Reconstruction of a simulated outbreak with GInPipe. D

φ estimates resemble the

underlying population dynamics over time. The blue line shows the smoothed median of the sub-sampled

estimates (dots) for a simulated

outbreak. The red line indicates true incidence per generation.

. Dotplot showing the true outbreak size from the simulation

true

versus the

point estimates for 10 stochastic simulations. The red line depicts the linear ﬁt.

we do observe a slight tendency of overestimation in these extreme cases (Suppl. Note 1, section SN.1.8).

Method validation: phylodynamics

Phylodynamic methods combine phylogeny reconstruction with epidemic models. For example, the piecewise constant

birth-death sampling process [55] implemented in BEAST2 [5], allows the reconstruction of the effective reproduction numbers

(τ)

for given time periods

. However, these methods are computationally expensive, so that only moderately sized sequence

sets can be used, and advanced knowledge is required to apply them properly to larger data sets.

We conducted phylodynamic analyses of SARS-CoV-2 sequence data from Denmark, Scotland, Switzerland, and the Australian

state Victoria. In analyzing the data we assumed that

BEAST

(

τ)

was piecewise constant in between major changes in SARS-

CoV-2 non-pharmaceutical interventions (intervals stated in Supplementary Note 2). We then used BEAST2 to estimate

3/15

BEAST

(

τ) alongside the tree reconstructions.

In parallel, we estimated corresponding effective reproduction numbers

(t)

by applying the Wallinga-Teunis method [61] to

incidence correlates

derived by GInPipe. For both methods, we used publicly available full length SARS-CoV-2 sequencing

data from GISAID [14, 54](Supplementary Note 4).

Results of both methods are shown in Figure 2. Overall, both methods show congruent trends for the analyzed countries,

when comparing the piecewise constant

BEAST

(

τ)

from phylodynamic analysis with the median daily

(t)

for the same

interval. Noteworthy, GInPipe allows for a much ﬁner time-resolution (daily

estimates) compared to the piecewise constant

estimates on pre-deﬁned intervals, obtained from the phylodynamic analysis.

For Denmark, the ﬁrst interval spans the decline in the number of infections after the ﬁrst wave (end of April to mid June).

Consequently, we observe

(τ) < 1

using both methods. For the next intervals, the median or piece-wise constant

(τ)

is predicted to be around, or slightly larger than one. However, GInPipe reconstructs a number of peaks in the daily

(t)

estimates, most pronounced in August, coinciding with the summer holidays in Europe. In the interval from November to mid

December the estimates deviate slightly, with a larger median estimate from BEAST2, however, both interval estimates are

predicted to be R

(t) > 1 and the conﬁdence intervals overlap entirely.

The

(τ)

estimates for Scotland agree almost exactly, where GInPipe again allows for a much ﬁner time-resolution. Once

again, we see a peak in the summer (August-September 2020), coinciding with the summer holidays in Europe. For the last

interval (from December 2020) both methods show a median

(t) > 1

, again with a slightly higher median BEAST2 estimate,

coinciding with the second wave of infections.

For Switzerland, the estimates disagree slightly, particularly in the ﬁrst interval (mid March to mid May), which spans both

sides of the peak number of infections during the ﬁrst wave. Although both methods predict a median

(τ) < 1

, the absolute

value differs in magnitude between the two methods, with BEAST2 estimating a much lower value. The lower estimate from

the BEAST2-analysis in the ﬁrst interval may be explained by the approximation of transmission clusters, which results in

the reconstruction of a relatively high number of transmission events many of which may have occurred outside Switzerland

(Supplementary Note 2, Figure SN.12 therein, tree B.1). In the daily estimates, we see a transition from

(t) > 1

(t) < 1

which may explain why the median prediction with GInPipe is close to one for the entire interval. The estimates are qualitatively

different for the second interval (mid May – mid June), where GInPipe estimates

(

τ) < 1

, while BEAST2 estimates

BEAST

(

τ) ≈ 1

. Again, GInPipe estimates a peak in summer (mid June-mid August

φ(τ) > 1

). While BEAST2 predicts the

onset of transmission in the second wave to already start in mid August (

(τ) > 1

), GInPipe estimates the ﬁrst major rise in

infections at the end of September.

For Victoria we observe an

(t) > 1

until mid March in the daily estimates. Overall,

is less than 1 for the ﬁrst interval

between mid March and May, versus

> 1

between June and August. Again, we see various peaks around June and July in

the daily

estimates with the proposed method. For the ﬁnal interval, both methods slightly disagree, with

BEAST

< 1

and

(

τ) > 1, though the daily R

(t) are decreasing towards the end of the ﬁnal interval.

In terms of computational time, the entire GInPipe analysis pipeline runs in 20 minutes on the full Denmark data set (n =

40.575 sequences) and in 7 minutes on the Victoria data set (n = 10.710 sequences) on a single notebook (2,3 Ghz, 2 cores).

Furthermore, GInPipe does not require to pre-assign any intervals, to exclude particular strains, construct a phylogenetic tree,

or cluster sequences based on a their phylogenetic relationship. The BEAST2 analysis alone required about 15 hours on an

Intel Xeon E5-2687W (3.1 Ghz, 2 x 12 cores) on a sub-sampled data set (

n ≈ 2500

sequences) with additional computation

time needed to construct a multiple sequence alignment and approximate transmission clusters.

Reconstructed incidence histories

We used GInPipe to reconstruct complete incidence histories for Denmark, Scotland, Switzerland, and Victoria (Australia)

from publicly available full length SARS-CoV-2 sequencing data provided through GISAID [14, 54] (Supplementary Note 4).

In Figure 3, we compare the reconstructed incidence histories (blue lines and dots, left axis) to the 7-day rolling average of

ofﬁcially reported new cases (red line, right axis). Overall, the reconstructed incidence estimates reﬂect the different pandemic

waves deduced from the reporting data, although there are quantitative differences between the reconstructed and reported

incidence trajectories over time. In particular, during the ﬁrst wave in Scotland, and Victoria (Fig. 3B,D) our method estimates

higher incidences than reported, whereas the curves align at later points for the second and third wave. It is worth mentioning

that testing capacities were particularly low in Scotland in April (during the ﬁrst wave), suggesting extensive under-reporting in

the initial phase of the pandemic. This is also supported by test positive rates of almost 40% during April 2020 in Scotland

(Supplementary Fig. 1). In Victoria, sufﬁcient testing capacities were not available until May, but test positive rates were

already declining from April to May (Supplementary Fig. 1). This indicates that the ﬁrst wave may have been under-reported in

magnitude, but had vanished by May.

Interestingly, the proposed incidence reconstruction method predicts small summer waves in August in the three European

countries (Fig. 3A–C) that are not visible in the reporting data. In the incidence reconstruction method these ‘summer waves’

4/15

Rapid incidence estimation from SARS-CoV-2 genomes reveals decreased case detection in Europe during summer 2020

Figures

Citations

Phylogenetic and phylodynamic approaches to understanding and combating the early SARS-CoV-2 pandemic

Rapid incidence estimation from SARS-CoV-2 genomes reveals decreased case detection in Europe during summer 2020.

Advancing Precision Vaccinology by Molecular and Genomic Surveillance of Severe Acute Respiratory Syndrome Coronavirus 2 in Germany, 2021

Prediction and estimation of effective population size

COVID-19 infection dynamics revealed by SARS-CoV-2 wastewater sequencing analysis and deconvolution

References

MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability

Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia.

Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine.

Minimap2: pairwise alignment for nucleotide sequences

BEAST 2: A Software Platform for Bayesian Evolutionary Analysis

Related Papers (5)

Using influenza surveillance networks to estimate state-specific case detection rates and forecast SARS-CoV-2 spread in the United States

Timing the SARS-CoV-2 Index Case in Hubei Province

Global transmission network of SARS-CoV-2: from outbreak to pandemic

Substantial underestimation of SARS-CoV-2 infection in the United States.

Infectious disease phylodynamics with occurrence data

Frequently Asked Questions (12)

Q1. What are the contributions in "Rapid incidence estimation from sars-cov-2 genomes reveals decreased case detection in europe during summer 2020" ?

Q2. How does GInPipe perform against in silico data?

Q3. What is the way to collect and report test results?

Q4. What was the process used to perform the phylodynamic analyses?

Q5. What was the sampling proportion for sD.2?

Q6. How many hours of computation were required to reconstruct the full incidence histories for Denmark, Scotland, Switzerland?

Q7. How did the authors estimate the effective reproduction numbers of SARS-CoV-2?

Q8. What can be used to estimate the effective reproduction number?

Q9. What is the probability of a decline in case detection in Switzerland?

Q10. What was the effect of the expansion of testing capacities?

Q11. What is the proposed method for estimating the evolution of a viral outbreak?

Q12. How did the case detection rate in Denmark increase from mid-May to mid-September?