scispace - formally typeset
Open AccessPosted ContentDOI

Rapid incidence estimation from SARS-CoV-2 genomes reveals decreased case detection in Europe during summer 2020

Reads0
Chats0
TLDR
A novel method that reconstructs SARS-CoV-2 incidence profiles within minutes, solely from publicly available, time-stamped viral genomes is proposed, and it is anticipated that the method is particularly useful in settings where diagnostic and reporting infrastructures are insufficient.
Abstract
In May 2021, over 160 million SARS-CoV-2 infections have been reported worldwide. Yet, the true amount of infections is unknown and believed to exceed the reported numbers by several fold, depending on national testing policies that can strongly affect the proportion of undetected cases. To overcome this testing bias and better assess SARS-CoV-2 transmission dynamics, we propose a genome-based computational pipeline, GInPipe, to reconstruct the SARS-CoV-2 incidence dynamics through time. After validating GInPipe against in silico generated outbreak data, as well as more complex phylodynamic analyses, we use the pipeline to reconstruct incidence histories in Denmark, Scotland, Switzerland, and Victoria (Australia) solely from viral sequence data. The proposed method robustly reconstructs the different pandemic waves in the investigated countries and regions, does not require phylodynamic reconstruction, and can be directly applied to publicly deposited SARS-CoV-2 sequencing data sets. We observe differences in the relative magnitude of reconstructed versus reported incidences during times with sparse availability of diagnostic tests. Using the reconstructed incidence dynamics, we assess how testing policies may have affected the probability to diagnose and report infected individuals. We find that under-reporting was highest in mid 2020 in all analysed countries, coinciding with liberal testing policies at times of low test capacities. Due to the increased use of real-time sequencing, it is envisaged that GInPipe can complement established surveillance tools to monitor the SARS-CoV-2 pandemic and evaluate testing policies. The method executes within minutes on very large data sets and is freely available as a fully automated pipeline from https://github.com/KleistLab/GInPipe.

read more

Content maybe subject to copyright    Report

Rapid incidence estimation from SARS-CoV-2
genomes reveals decreased case detection in
Europe during summer 2020
Maureen Smith
Robert Koch Institute
Maria Tromova
Robert Koch Institute
Ariane Weber
Max-Planck Institute
Yannick Duport
Robert Koch Institute
Denise Kühnert
Department of Archaeogenetics, Max Planck Institute for the Science of Human History, 07745 Jena,
Germany
https://orcid.org/0000-0002-5657-018X
Max von Kleist ( kleistm@rki.de )
MF1 Bioinformatics, Robert Koch-Institute https://orcid.org/0000-0001-6587-6394
Article
Keywords: SARS-CoV-2, epidemiology, genomes
Posted Date: May 27th, 2021
DOI: https://doi.org/10.21203/rs.3.rs-558667/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. 
Read Full License
Version of Record: A version of this preprint was published at Nature Communications on October 14th,
2021. See the published version at https://doi.org/10.1038/s41467-021-26267-y.

Rapid incidence estimation from SARS-CoV-2
genomes reveals decreased case detection in
Europe during summer 2020
Maureen Rebecca Smith
1, 2,*,+
, Maria Trofimova
1,2,*
, Ariane Weber
3
, Yannick Duport
1,2
,
Denise K ¨uhnert
3,4
, and Max von Kleist
1,2,4,+
1
Systems Medicine of Infectious Disease (P5), Robert Koch Institute, Berlin, Germany
2
Bioinformatics (MF1), Robert Koch Institute, Berlin, Germany
3
Transmission, Infection, Diversification and Evolution Group, Max-Planck Institute for the Science of Human
History, Jena, Germany
4
German COVID Omics Initiative (deCOI)
*
these authors contributed equally to this work
+
smithm@rki.de
+
kleistm@rki.de
ABSTRACT
By May 2021, over 160 million SARS-CoV-2 diagnoses have been reported worldwide. Yet, the true number of infections
is unknown and believed to exceed the reported numbers by several fold. National testing policies, in particular, can strongly
affect the proportion of undetected cases.
Here, we propose a novel method (GInPipe) that reconstructs SARS-CoV-2 incidence profiles within minutes, solely from
publicly available, time-stamped viral genomes. We validated GInPipe against in silico generated outbreak data and elaborate
phylodynamic analyses. We apply the method to reconstruct incidence histories from sequence data for Denmark, Scotland,
Switzerland, and Victoria (Australia). GInPipe reconstructs the different pandemic waves robustly and remarkably accurate. We
demonstrate how the method can be used to investigate the effects of changing testing policies on the probability to diagnose
and report infected individuals. Specifically, we find that under-reporting was highest in mid 2020 in parts of Europe, coinciding
with changes towards more liberal testing policies at times of low testing capacities.
Due to the increased use of real-time sequencing, it is envisaged that GInPipe can complement established surveillance
tools to monitor the SARS-CoV-2 pandemic. We anticipate that the method is particularly useful in settings where diagnostic
and reporting infrastructures are insufficient. In ‘post-pandemic’ times, when diagnostic efforts are decreased, GInPipe may
facilitate the detection of hidden infection dynamics.
Introduction
As of May 2021, the global SARS-CoV-2 pandemic is still ongoing in most parts of the world, with 160 million reported cases
worldwide. Novel vaccines of high efficacy have been developed within a year of the outbreak [2, 46]. At the time of writing,
approximately 8.2% of the worlds population had already received at least one vaccination. However, distribution of vaccines is
uneven and achieving global herd immunity may pose an extremely difficult, long-term task [63, 36]. At the same time, novel
variants of concern (VOC) have emerged in high prevalence regions [6, 34], which may be able to reinfect individuals [21, 37]
and escape vaccine elicited immune responses [33, 66, 45]. For example, Manaus, Brazil, witnessed a massive second wave of
infections [51], despite the fact that approx. 80% had already experienced an infection at the onset of the second wave [6].
Because of the evolutionary versatility of SARS-CoV-2 and difficulties in global vaccine distribution, some experts expect that
the virus may not be eliminated globally [44]. Even without adaptation to vaccines in the future, it has been postulated that
SARS-CoV-2 may resurge [24, 50] and surveillance may have to be maintained into the mid 2020s to monitor virus spread and
evolution [24].
Currently, the gold standard of SARS-CoV-2 surveillance is diagnostic testing via polymerase chain reaction (PCR) or antigen-
based rapid diagnostic testing (RDT). Diagnostic test results currently define infection case reports, which are used to survey
1

epidemiological dynamics and to define thresholds for travel bans and non-pharmaceutical measures. Inevitably, case reporting
data is affected by test coverage, which changes when testing policies are adapted. While RDT enables point-of-care diagnosis
and is less costly than PCR testing [13, 12], gathering and reporting of test results still requires a sophisticated infrastructure,
which is difficult to establish and maintain in many developing countries [35]. Independent and complimentary sources of
information, such as social media reports [31, 53] or waste water analysis [9, 43] have been used early on to complement our
knowledge of the pandemic dynamics. In addition, many regions of the world sequence SARS-CoV-2 genomes to track virus
evolution and the emergence of variants of concern. The gathered viral sequences are regularly provided to public databases,
such as GISAID [14, 54]. We hypothesize that the genetic data alone holds information about the pandemic trajectory. More
specifically, we presume that the speed at which SARS-CoV-2 evolves on the population level contains information about the
number of individuals who are actively infected.
In the vast majority of cases, SARS-CoV-2 is transmitted within a very short period, only days after infection [30, 17]. The
consequence is a well-defined duration of intra-patient evolutionary time before transmission. Thus, the number of infected
individuals is correlated to the rate of divergence of the viral population, implicating an ‘evolutionary signal’.
In this article, we introduce the computational pipeline GInPipe, which only uses time-stamped sequencing data, extracts
the ‘evolutionary signal’ and reconstructs SARS-CoV-2 incidence histories. The approach builds on recent work by Khatri and
Burt [23], who derived a simple function that relates the mean number of mutant origins to the current allele frequency and
the mutational input, which is proportional to the effective population size. Herein, due to the short window of transmission,
we anticipate that the effective population size may strongly correlate with the incidence of SARS-CoV-2. We adapt the
function derived in [23] and embed it into an automatic computational pipeline (GInPipe) that reconstructs the time course of an
incidence correlate
φ
merely from SARS-CoV-2 genetic data. GInPipe is validated threefold and performs robustly: (i) against
in silico generated outbreak data, (ii) against phylodynamic analysis and (iii) in comparison with case reporting data. We
applied the method to SARS-CoV-2 sequencing data from Denmark, Scotland, Switzerland, and the Australian state Victoria to
reconstruct their respective incidence histories. Lastly, we utilize the inferred epidemic trajectories to compute changes in the
probability that an infected individual is reported and highlight how this probability is affected by changes in testing policies.
Results
Incidence reconstruction
An outline of GInPipe for SARS-CoV-2 incidence reconstruction is shown in Figure 1A-C. After compiling a set of time-
stamped, full-length SARS-CoV-2 genomes, the sequences are placed into temporal bins
b
(Fig. 1A). For each bin, we compute
the number of mutant sequences
m
b
, as well as the number of haplotypes
h
b
. These two inputs are used to infer the incidence
correlate
φ
b
(Fig. 1B). We then smooth over all
φ
b
point estimates and derive a reconstructed incidence history along the time
axis ( Fig. 1C). The reconstructed incidence histories can then be used as a basis to estimate the effective reproduction number
R
e
, as well as the relative case detection rate as outlined below.
Method validation: in silico experiment
To test whether GInPipe correctly reconstructs incidence histories, we first performed an in silico experiment. We considered
a population of
N(t)
infected individuals at time
t
that stochastically generate
N(t + 1)
infected individuals in the next time
step
t +1
. Each individual is associated with a virus sequence, which can mutate randomly. Individuals can be removed (the
associated sequence is removed), or they transmit their virus (the associated virus is copied over). We record the number of
infected individuals per generation, as well as all sequences of the currently circulating viruses. We then use the simulated viral
sequences to infer φ (t) and reconstruct the incidence history, as presented in Figure 1D-E.
In Figure 1D, we compare one trajectory of simulated population sizes with the reconstructed incidence histories. The simulated
outbreak (red line, right axis) consists of two waves of increasing magnitude. GInPipe reconstructs these dynamics (blue lines
and dots, left axis) quite accurately, although the incidence correlate
φ(t)
is on a different scale, implying a linear correlation
to the number of infected individuals. To assess this correlation, we performed 10 stochastic simulations and compared the
φ(t)
point estimates with the corresponding number of infected individuals (Fig. 1E). We observed a strong (
r = 0.96
) and
highly significant (
p < 10
16
) linear relationship between the number of infected individuals
N(t)
and the method’s incidence
correlate φ (t).
While these simulations represent idealized scenarios, we evaluated the robustness of GInPipe with regards to incomplete, and
sparse data sets, thoroughly elaborated in Supplementary Note 1.
Our analyses showed, that the method can still accurately reconstruct incidence histories over time, when data is missing or when
data sampling is unbalanced. In scenarios of extreme under-sampling, the
φ
point estimates are prone to slight underestimation.
However, through the smoothing step the reconstructed incidence trajectories still follow the overall population dynamics
(Suppl. Note 1, section SN.1.7). Finally, we evaluated whether introductions of foreign sequences affect the reconstruction of
incidence histories. Even for extreme and unrealistic cases, a stable reconstruction of the underlying dynamic is possible, but
2/15

h
b
=3
A Re-sampling of
sequence sets
B Determination of
C Smoothing
D Reconstruction of a
simulated outbreak
E Simulated incidence vs.
reconstruction
m
b
=4
0
25
50
75
0 1000 2000 3000 4000
N
true
φ
est
0
5
10
15
20
25
0
250
500
750
1000
1250
0 20 40 60 80 100 120
number of generations
φ
est
N
true
φ
b
φ
b
φ
b
Figure 1. Reconstruction of incidence histories using the proposed method. A–C Schematic of the incidence reconstruction method. A
The sequences are chronologically ordered by collection date. The line shows the cumulative sum of sequences over time. The sequences are
allocated into temporal bins, spanning either the same time frame
d
b
(yellow and purple bins) or containing the same amount of sequences
(green bins). B For each bin, the number of distinct variants h
b
, as well as the total amount of mutant sequences m
b
are used to infer the
incidence correlate φ
b
. C The point estimates for all bins
φ
b
(dots) are smoothed with a convolution filter. For uncertainty estimation, the
point estimates are sub-sampled and interpolated. D–E Reconstruction of a simulated outbreak with GInPipe. D
φ estimates resemble the
underlying population dynamics over time. The blue line shows the smoothed median of the sub-sampled
φ
estimates (dots) for a simulated
outbreak. The red line indicates true incidence per generation.
E
. Dotplot showing the true outbreak size from the simulation
N
true
versus the
φ
b
point estimates for 10 stochastic simulations. The red line depicts the linear fit.
we do observe a slight tendency of overestimation in these extreme cases (Suppl. Note 1, section SN.1.8).
Method validation: phylodynamics
Phylodynamic methods combine phylogeny reconstruction with epidemic models. For example, the piecewise constant
birth-death sampling process [55] implemented in BEAST2 [5], allows the reconstruction of the effective reproduction numbers
R
e
(τ)
for given time periods
τ
. However, these methods are computationally expensive, so that only moderately sized sequence
sets can be used, and advanced knowledge is required to apply them properly to larger data sets.
We conducted phylodynamic analyses of SARS-CoV-2 sequence data from Denmark, Scotland, Switzerland, and the Australian
state Victoria. In analyzing the data we assumed that
R
BEAST
e
(
τ)
was piecewise constant in between major changes in SARS-
CoV-2 non-pharmaceutical interventions (intervals stated in Supplementary Note 2). We then used BEAST2 to estimate
3/15

R
BEAST
e
(
τ) alongside the tree reconstructions.
In parallel, we estimated corresponding effective reproduction numbers
R
φ
e
(t)
by applying the Wallinga-Teunis method [61] to
incidence correlates
φ
derived by GInPipe. For both methods, we used publicly available full length SARS-CoV-2 sequencing
data from GISAID [14, 54](Supplementary Note 4).
Results of both methods are shown in Figure 2. Overall, both methods show congruent trends for the analyzed countries,
when comparing the piecewise constant
R
BEAST
e
(
τ)
from phylodynamic analysis with the median daily
R
φ
e
(t)
for the same
interval. Noteworthy, GInPipe allows for a much finer time-resolution (daily
R
e
estimates) compared to the piecewise constant
R
e
estimates on pre-defined intervals, obtained from the phylodynamic analysis.
For Denmark, the first interval spans the decline in the number of infections after the first wave (end of April to mid June).
Consequently, we observe
R
e
(τ) < 1
using both methods. For the next intervals, the median or piece-wise constant
R
e
(τ)
is predicted to be around, or slightly larger than one. However, GInPipe reconstructs a number of peaks in the daily
R
φ
e
(t)
estimates, most pronounced in August, coinciding with the summer holidays in Europe. In the interval from November to mid
December the estimates deviate slightly, with a larger median estimate from BEAST2, however, both interval estimates are
predicted to be R
e
(t) > 1 and the confidence intervals overlap entirely.
The
R
e
(τ)
estimates for Scotland agree almost exactly, where GInPipe again allows for a much finer time-resolution. Once
again, we see a peak in the summer (August-September 2020), coinciding with the summer holidays in Europe. For the last
interval (from December 2020) both methods show a median
R
e
(t) > 1
, again with a slightly higher median BEAST2 estimate,
coinciding with the second wave of infections.
For Switzerland, the estimates disagree slightly, particularly in the first interval (mid March to mid May), which spans both
sides of the peak number of infections during the first wave. Although both methods predict a median
R
e
(τ) < 1
, the absolute
value differs in magnitude between the two methods, with BEAST2 estimating a much lower value. The lower estimate from
the BEAST2-analysis in the first interval may be explained by the approximation of transmission clusters, which results in
the reconstruction of a relatively high number of transmission events many of which may have occurred outside Switzerland
(Supplementary Note 2, Figure SN.12 therein, tree B.1). In the daily estimates, we see a transition from
R
φ
e
(t) > 1
to
R
φ
e
(t) < 1
,
which may explain why the median prediction with GInPipe is close to one for the entire interval. The estimates are qualitatively
different for the second interval (mid May mid June), where GInPipe estimates
R
φ
e
(
τ) < 1
, while BEAST2 estimates
R
BEAST
e
(
τ) 1
. Again, GInPipe estimates a peak in summer (mid June-mid August
R
e
φ(τ) > 1
). While BEAST2 predicts the
onset of transmission in the second wave to already start in mid August (
R
e
(τ) > 1
), GInPipe estimates the first major rise in
infections at the end of September.
For Victoria we observe an
R
φ
e
(t) > 1
until mid March in the daily estimates. Overall,
R
e
is less than 1 for the first interval
between mid March and May, versus
R
e
> 1
between June and August. Again, we see various peaks around June and July in
the daily
R
e
estimates with the proposed method. For the final interval, both methods slightly disagree, with
R
BEAST
e
< 1
and
R
φ
e
(
τ) > 1, though the daily R
φ
e
(t) are decreasing towards the end of the final interval.
In terms of computational time, the entire GInPipe analysis pipeline runs in 20 minutes on the full Denmark data set (n =
40.575 sequences) and in 7 minutes on the Victoria data set (n = 10.710 sequences) on a single notebook (2,3 Ghz, 2 cores).
Furthermore, GInPipe does not require to pre-assign any intervals, to exclude particular strains, construct a phylogenetic tree,
or cluster sequences based on a their phylogenetic relationship. The BEAST2 analysis alone required about 15 hours on an
Intel Xeon E5-2687W (3.1 Ghz, 2 x 12 cores) on a sub-sampled data set (
n 2500
sequences) with additional computation
time needed to construct a multiple sequence alignment and approximate transmission clusters.
Reconstructed incidence histories
We used GInPipe to reconstruct complete incidence histories for Denmark, Scotland, Switzerland, and Victoria (Australia)
from publicly available full length SARS-CoV-2 sequencing data provided through GISAID [14, 54] (Supplementary Note 4).
In Figure 3, we compare the reconstructed incidence histories (blue lines and dots, left axis) to the 7-day rolling average of
officially reported new cases (red line, right axis). Overall, the reconstructed incidence estimates reflect the different pandemic
waves deduced from the reporting data, although there are quantitative differences between the reconstructed and reported
incidence trajectories over time. In particular, during the first wave in Scotland, and Victoria (Fig. 3B,D) our method estimates
higher incidences than reported, whereas the curves align at later points for the second and third wave. It is worth mentioning
that testing capacities were particularly low in Scotland in April (during the first wave), suggesting extensive under-reporting in
the initial phase of the pandemic. This is also supported by test positive rates of almost 40% during April 2020 in Scotland
(Supplementary Fig. 1). In Victoria, sufficient testing capacities were not available until May, but test positive rates were
already declining from April to May (Supplementary Fig. 1). This indicates that the first wave may have been under-reported in
magnitude, but had vanished by May.
Interestingly, the proposed incidence reconstruction method predicts small summer waves in August in the three European
countries (Fig. 3A–C) that are not visible in the reporting data. In the incidence reconstruction method these ‘summer waves’
4/15

Figures
Citations
More filters
Journal ArticleDOI

Phylogenetic and phylodynamic approaches to understanding and combating the early SARS-CoV-2 pandemic

TL;DR: In this paper , the authors describe how phylogenetic and phylodynamic methods provide insight into viral evolution, focusing on the SARS-CoV-2 pandemic, and summarize their contributions to our understanding of SARS transmission and control.
Journal ArticleDOI

Rapid incidence estimation from SARS-CoV-2 genomes reveals decreased case detection in Europe during summer 2020.

TL;DR: GInPipe as mentioned in this paper is a method that rapidly reconstructs SARS-CoV-2 incidence profiles from publicly available, time-stamped viral genomes using available sequence data, and demonstrate how to use the method to investigate the effects of changing testing policies on case ascertainment.

Prediction and estimation of effective population size

Abstract: Effective population size (Ne) is a key parameter in population genetics. It has important applications in evolutionary biology, conservation genetics and plant and animal breeding, because it measures the rates of genetic drift and inbreeding and affects the efficacy of systematic evolutionary forces, such as mutation, selection and migration. We review the developments in predictive equations and estimation methodologies of effective size. In the prediction part, we focus on the equations for populations with different modes of reproduction, for populations under selection for unlinked or linked loci and for the specific applications to conservation genetics. In the estimation part, we focus on methods developed for estimating the current or recent effective size from molecular marker or sequence data. We discuss some underdeveloped areas in predicting and estimating Ne for future research.
Posted ContentDOI

COVID-19 infection dynamics revealed by SARS-CoV-2 wastewater sequencing analysis and deconvolution

TL;DR: In this paper, the authors presented PiGx SARS-CoV-2, a bit-by-bit reproducible end-to-end pipeline with comprehensive reports that includes all steps from raw-data to shareable reports, additional taxonomic analysis, deconvolution and geospatial time series analysis.
References
More filters
Journal ArticleDOI

MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability

TL;DR: This version of MAFFT has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update.
Journal ArticleDOI

Minimap2: pairwise alignment for nucleotide sequences

TL;DR: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database and is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mapper at higher accuracy, surpassing most aligners specialized in one type of alignment.
Journal ArticleDOI

BEAST 2: A Software Platform for Bayesian Evolutionary Analysis

TL;DR: BEAST 2 now has a fully developed package management system that allows third party developers to write additional functionality that can be directly installed to the BEAST 2 analysis platform via a package manager without requiring a new software release of the platform.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions in "Rapid incidence estimation from sars-cov-2 genomes reveals decreased case detection in europe during summer 2020" ?

Yet, the true number of infections is unknown and believed to exceed the reported numbers by several fold. Here, the authors propose a novel method ( GInPipe ) that reconstructs SARS-CoV-2 incidence profiles within minutes, solely from publicly available, time-stamped viral genomes. The authors demonstrate how the method can be used to investigate the effects of changing testing policies on the probability to diagnose and report infected individuals. Specifically, the authors find that under-reporting was highest in mid 2020 in parts of Europe, coinciding with changes towards more liberal testing policies at times of low testing capacities. The authors anticipate that the method is particularly useful in settings where diagnostic and reporting infrastructures are insufficient. 

GInPipe is validated threefold and performs robustly: (i) against in silico generated outbreak data, (ii) against phylodynamic analysis and (iii) in comparison with case reporting data. 

While RDT enables point-of-care diagnosis and is less costly than PCR testing [13, 12], gathering and reporting of test results still requires a sophisticated infrastructure, which is difficult to establish and maintain in many developing countries [35]. 

Phylodynamic analyses were performed on subsampled sets of the data described above (Data and data pre-processing) using a birth-death-sampling process as implemented in the BDSKY [55] model in BEAST2 [5]. 

The sampling proportion s(t) = ψ(t)/(ψ(t)+ µ(t)) was a priori assumed to arise from a uniform distribution with a lower limit of zero and the upper limit determined by the ratio of analyzed sequences over diagnosed cases s ∼U (0,qi/di) where di is the number of diagnoses and qi the number of sequences included in the analysis in interval i. 

The authors used GInPipe to reconstruct complete incidence histories for Denmark, Scotland, Switzerland, and Victoria (Australia) from publicly available full length SARS-CoV-2 sequencing data provided through GISAID [14, 54] (Supplementary Note 4). 

In parallel, the authors estimated corresponding effective reproduction numbers R φ e (t) by applying the Wallinga-Teunis method [61] to incidence correlates φ derived by GInPipe. 

The reconstructed incidence histories can then be used as a basis to estimate the effective reproduction number Re, as well as the relative case detection rate as outlined below. 

4. Interestingly, their method predicts a decline in case detection in Switzerland after the broad introduction of antigen self-testing in November 2020. 

testing capacities were further expanded, especially in the health sector, including hospital patients, health and social care staff, with fairly stable case detection rates. 

A fully automated workflow has been generated using Snakemake [26] and is available from https://github.com/KleistLab/GInPipe.To test the proposed incidence reconstruction method, the authors stochastically simulated the evolutionary dynamics of a viral outbreak using a Poisson process formalism. 

compared to the fairly stable case detection levels from mid March to mid May, this policy change leads to a 2-3 fold drop in case detection in the summer months from July-September.