scispace - formally typeset
Open AccessJournal ArticleDOI

Can mental health diagnoses in administrative data be used for research? A systematic review of the accuracy of routinely collected diagnoses

Reads0
Chats0
TLDR
Diagnostic category-specific positive predictive values (PPV) and Cohen’s kappa results showed moderate agreement between source data and reference standard for most diagnostic categories, but for some diagnoses, such as anxiety disorders, the data were less satisfactory.
Abstract
There is increasing availability of data derived from diagnoses made routinely in mental health care, and interest in using these for research. Such data will be subject to both diagnostic (clinical) error and administrative error, and so it is necessary to evaluate its accuracy against a reference-standard. Our aim was to review studies where this had been done to guide the use of other available data. We searched PubMed and EMBASE for studies comparing routinely collected mental health diagnosis data to a reference standard. We produced diagnostic category-specific positive predictive values (PPV) and Cohen’s kappa for each study. We found 39 eligible studies. Studies were heterogeneous in design, with a wide range of outcomes. Administrative error was small compared to diagnostic error. PPV was related to base rate of the respective condition, with overall median of 76 %. Kappa results on average showed a moderate agreement between source data and reference standard for most diagnostic categories (median kappa = 0.45–0.55); anxiety disorders and schizoaffective disorder showed poorer agreement. There was no significant benefit in accuracy for diagnoses made in inpatients. The current evidence partly answered our questions. There was wide variation in the quality of source data, with a risk of publication bias. For some diagnoses, especially psychotic categories, administrative data were generally predictive of true diagnosis. For others, such as anxiety disorders, the data were less satisfactory. We discuss the implications of our findings, and the need for researchers to validate routine diagnostic data.

read more

Content maybe subject to copyright    Report

RES E A R C H A R T I C L E Open Access
Can mental health diagnoses in
administrative data be used for research? A
systematic review of the accuracy of
routinely collected diagnoses
Katrina A. S. Davis
1
, Cathie L. M. Sudlow
2
and Matthew Hotopf
1,3*
Abstract
Background: There is increasing availability of data derived from diagnoses made routinely in mental health
care, and interest in using these for research. Such data will be subject to both diagnostic (clinical) error and
administrative error, and so it is necessary to evaluate its accuracy against a reference-standard. Our aim was to
review studies where this had been done to guide the use of other available data.
Methods: We searched PubMed and EMBASE for studies comparing routinely collected mental health diagnosis
data to a reference standard. We produced diagnostic category-specific positive predictive values (PPV) and
Cohens kappa for each study.
Results: We found 39 eligible studies. Studies were heterogeneous in design, with a wide range of outcomes.
Administrative error was small compared to diagnostic error. PPV was related to base rate of the respective
condition, with overall median of 76 %. Kappa results on average showed a moderate agreement between source
data and reference standard for most diagnostic categories (median kappa = 0.450.55); anxiety disorders and
schizoaffective disorder showed poorer agreement. There was no significant benefit in accuracy for diagnoses
made in inpatients.
Conclusions: The current evidence partly answered our questions. There was wide variation in the quality of source
data, with a risk of publication bias. For some diagnoses, especially psychotic categories, administrative data were
generally predictive of true diagnosis. For others, such as anxiety disorders, the data were less satisfactory. We
discuss the implications of our findings, and the need for researchers to validate routine diagnostic data.
Keywords: Psychiatry, Diagnosis, Population research, Administrative data, Electronic health records, Case registers,
Hospital episode statistics
Background
Databases such as those produced by electronic health
records or for reimbursement of medical costs, contain
routinely collected data on diagnosis that has consider-
able application in research, such as for ascertaining out-
comes in epidemiology or identifying suitable research
participants for clinical trials [13]. There has been a
long history of using routine data in mental health
research, from the earliest studies of asylum records
through to the case register of the 20th century [4]. The
easy availability of large volumes of data regarding pa-
tients with mental health diagnoses from routine clinical
practice following the shift to electronic health records
can be utilised for research [5, 6], and massed electro-
nically produced administrative data has been used by a
diverse range of groups, using routinely collected diagno-
sis to identify cases of mental illness for public health and
advocacy [710].
* Correspondence: matthew.hotopf@kcl.ac.uk
1
Department of Psychological Medicine, Institute of Psychiatry Psychology
and Neuroscience, Kings College London, London, UK
3
Department of Psychological Medicine and SLaM/IoPPN BRC, Kings College
London, PO62, Weston Education Centre, Cutcombe Road, London SE5 9RJ, UK
Full list of author information is available at the end of the article
© 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Davis et al. BMC Psychiatry (2016) 16:263
DOI 10.1186/s12888-016-0963-x

Biobanks may also link to administrative databases:
connecting genomic, physiological and self-report data
with hospital episodes and death registration, to become
powerful tools to gain insight into risk and protective
factors of a wide range of diseases. UK Biobank recruited
500,000 people aged between 40 and 69 years in 2006
2010 from across the UK [11], and data linkage includes
to Hospital Episode Statistics (HES) in England, and the
equivalent datasets in Scotland and Wales, which log
every hospital admission, includin g to psychiatric hospi-
tals , and include ICD-10 diagnosis codes (WHOs Inter-
national Classification of Diseases) [12]. Such linkages
provide a means of greatly enriching UK Biobanks out-
comes in a cost-effective and scalable manner, and there
would be the opportunity for identifying cases of psychi-
atric illness through ICD-10 codes from HES and other
records. Similar data linkages are in plac e for other large
studies [4].
Despite the promise of data linkage, there are inevit-
ably concerns that routine data, collected for non-
research purposes, may be prone to misclassification.
Accuracy can be affected by errors at a number of
points, broadly described as diagnostic error and ad-
ministrative error. Diagnostic error occurs when the
clinician fails to find the signs/symptoms of the correct
condition, makes a diagnosis not supported by research
criteria, or records a diagnosis at odds with their real
conclusion. Administrative error involves issues around
turning the physician diagnosis into codes (ICD in the
case of Hospital Episode Statistics), and submitt ing these
codes attached to the correct record and identifiers.
Coding traditionally utilised trained non-clinical ad-
ministrators interpreting the treating clinicians hand-
written records to derive a valid ICD code for the record
[13], which is inevitab ly error-prone although in the
age of electronic health records , where the clinicians
generally assign diagnosis codes, data entry error and
miscoding still occurs [5, 14, 15].
Recent reviews of accuracy of English HES data have
mainly concentrated on administrative error [1, 16, 17],
and there is a lack of specific information on diagnostic
accuracy for psychiatric disorders. In mental health there
may be particular issues about diagnostic error, which
would be reflected in evaluations of the quality of psy-
chiatric diagnoses in other data sources [15, 18]. This
may help when considering using HES and other such
administrative databases to identify cases of mental ill-
ness. A previous attempt to collate results from a variety
of psychiatric databases by Byrne et al. from Kings
College London in 2005, identified that papers were
mostly of poor quality, and the results were too variable
to give an overall view on diagnostic validity [19].
The aim of the present systematic review was to identify
and collate results regarding the accuracy of diagnosis in
routinely collected data from mental health settings to
guide the interpretation of the use of such data to identify
cases. Specifically our objectives were: to evaluate the
agreement and validity of a routinely recorded diagnosis
compared with a reference diagnosis for psychiatric disor-
ders (i) in general, (ii) for different psychiatric diagnoses,
and (iii) comparing diagnoses made as inpatients with
outpatients.
Methods
We used Preferred Reporting Items for Systematic Re-
views and Meta-Analyses (PRISMA) guidelines to develop
the design, conduct and reporting of this review. One
author (KD) carried out the search and extracted data.
Search strategy
We searched Medline (PubMed) and Embase from 1980
to November 2014 for studies assessing the accuracy of
routinely collected data regarding psychiatric diagnosis
against a reference standard diagnosis. We used a com-
bination of medical subject heading and text word terms
for mental health; accuracy, reliability and validity;
diagnosis, ICD and DSM; and medical records, coding
or registers. We reviewed bibliographies of included
publications and used Google Scholar to identify any
citing papers for additional relevant reviews or studies
(see Additional files 1 and 2: Figure S1 and S2 for detail
of search strategy).
Eligibility criteria
Studies were included if they were a peer-reviewed pub-
lished comparison of psychiatric diagnoses in routinely
recorded data against reference standard diagnoses using
ICD, DSM or similar psychiatric classification systems.
The studies included samples of patients recruited from
population, primary or secondary care settings; however,
the diagnoses under study were those derived from
secondary care only - either inpatient or outpa tient psy-
chiatric services. The data that was being examined
(source diagnosis) could be taken from official clinical
documentation [clinical] or from a research or admin-
istrative database. Where a clinical source diagnosis was
used, the comparison data (reference diagnosis) had to
be a research diagnosis [research] to look at diagnostic
error, but where a database source diagnosis was used,
clinical documentation [chart] could also be used for a
reference diagnosis to look at administrative error. Com-
paring a database source diagnosis and a research refer-
ence diagnosis gives clinical and administrative error
combined. Research diagnoses could be considered ref-
erence diagnoses whether they used structured casenote
review and/or research interview to reach the diagnosis,
as long as they conformed to Spitzers Longitudinal,
Expert and All Data (LEAD) diagnostic approach [20].
Davis et al. BMC Psychiatry (2016) 16:263 Page 2 of 11

Studies wer e reviewed for inclusion by KD, and where
there was doubt, discussed with MH.
We assessed each eligible paper for quality using an
established checklist [21] which marks studies on aims
(3 marks), method (9), results & discussion (10). There
were no suggested cut-off points with this checklist, so
we defined criteria for inadeq uate, poor, moderate and
good quality using total and category-specific scores.
The studies considered inadequate were those which
scored less than two in any category or less than ten
overall. Studies were considered good quality if they
scored at least 75 % of the points from each section (see
Additional file 3: Table S1 for the quality rating of indi-
vidual papers).
Data extraction
We devised a form to extract information from each
study which included (1) the nature of the cohort stud-
ied, including clinical setting, selection criteria, location,
sample size and age range; (2) source of routine diagnos-
tic data; (3) nature of referen ce diagnosis, how it was de-
rived, and any measures of reliability for this diagnosis;
(4) the diagnosis, diagnostic grouping or diagnoses
under study, and the diagnostic system used (e.g. ICD/
DSM); (5) the base rate for each diagnosis studied (i.e.
the prevalence in the setting the diagnosis was made ac-
cording to reference diagnosis); (6) measures of concord-
ance between diagnostic data and reference diagnosis:
validity measures sensitivity, specificity, positive- and
negative-predictive values and agreement measures
percentage agreement, Cohens kappa (k) and area under
the curve.
Data analysis
After consideration of the data available from the papers,
and our aim to a ssess the accuracy of case finding by
using routine diagnosis we chose two parameters to
report: (1) Positiv e predictive value (PPV) prov ides an
estimate of the probability that a given diagnosis in the
source data will match the reference diagnosis acting as
gold standard; (2) Cohens kappa provides a measure
of agreement between the source data and the reference
comparison. The sensitivity and negative predictive value
are useful for considering representativeness and the re-
cruitment of controls, but they are of most use when
using true population studies, where unidentified cases
can be found, rather than the secondary care studies
identified here.
We give diagnosis-spe cific results at chapter level (eg
affective disorders ) and disorder level (eg bipolar
affective disorder) according to the reporting in the ori-
ginal papers. Some papers report at both chapter level
and disorder level, in which case the results for the dis-
order will be a subset of the results for the chapter.
Otherwise, we treated results within the same study as
independent for data analysis purposes.
Using cross-tabulations provided in the source paper, or
working back from accuracy statistics, a 2x2 table was con-
structed of true-positives, false-positives, false-negatives,
and true negatives for each diagnosis studied in each paper.
From this, the PPV and percentage agreement was calcu-
lated. It was thus possible to calculate a PPV for all of the
specific outcome categories, even where not originally re-
ported, with 95 % confidence intervals calculated using
Wilsons method [22].
Cohens kappa was calculated from the observed and
expected agreement [23]. Two difficulties were encoun-
tered: (i) where no-one without the diagnosis in source
data was studied, kappa could not be calculated; (ii)
where agreement was worse than chance, a negative
kappa results; since the magnitude of a negative kappa is
uninformative this was regarded as zero.
We did not undertake formal meta-analysis or meta-
regression due to the heterogeneity between studies in
their methods, participant characteristics and reporting.
We used non-parametric tests Kruskall-Wallis H with
Bonferroni correction - to assess for independence of
groups for data source, and to explore setting of diagnosis.
Calculations and graphs were performed using Microsoft
Excel 2013 with the Real Statistics plug-in [24].
Results
Papers
Figure 1 shows the PRIMSA flow chart for the review.
Our literature search identified 117 potential publications.
Of these 72 were excluded, and a further six were found
to be of inadequate quality, leaving a total of 39 [2563].
The excluded papers and reasons for exclusion are in
Additional file 4: Table S2.
Included studies are described in Additional file 5:
Table S3. The publications were predominantly Scandi-
navian (n = 22) and from the USA (n = 10), with the four
largest studies coming from Canada. They were published
between 1988 and 2014 although they reflect diagnoses
made up to 20 years prior to the date of publication of the
studies. Many had been published with a view to using the
source data for further research.
Study design
Cohorts ranged from samples of the general population
to inpatients with specified working diagnosis. The preva-
lence of specified diagnoses in secondary care (base rates)
varied widely. The number of diagnostic categories exam-
ined in each study varied between one and eight. In all,
there were 16 diagnostic categories considered. In the 39
papers studied, there were 104 diagnosis-specific results.
The most common diagnosis studied was schizophrenia
(n = 19), followed by bipolar affective disorder (n = 12) and
Davis et al. BMC Psychiatry (2016) 16:263 Page 3 of 11

unipolar depression (n = 12). Ten results showed the over-
all agreement across a number of diagnoses. A number of
studies used the category of schizoph ren ia spectrum
(n = 13) to describe a group of psychotic disorders
usually including schizophrenia, schizotypal disorder
and schizoaffective disorder, but varying on the inclusion
of other schizophreniform psychoses and delusional disor-
ders. Since the studies were comparing like-for-like in
their routine and reference diagnoses, we used the term
schizophrenia spectrum whenever a group of non-
affective psychoses including schizophrenia was studied,
without further differentiation.
The source data wa s derived directly from clinical
notes in 13 studies (57 diagnosis-spe cific results), while
26 studies used databases: 17 used regional and national
research databases; nine larger studies used databases
created primarily for administrative purposes. The refer-
ence diagnosis was the chart diagnosis in four studies,
and was otherwise a research diagnosis. Research diag-
noses consisted of a notes review in 15, an interview in
five, and an interview with notes review in 15. Thirteen
studies used more than one researcher reaching a
diagnosis independently and reported the inter-rater reli-
ability of the research diagn osis. In 11 cases, this could
be compared with the kappa agreement between source
and reference [33, 34, 36, 38, 4446, 50, 57, 59, 61].
There are three groups of results: those using a data-
base diagnosis as the source and chart diagnosis as the
reference, giving administrative error only (six results
from four papers); those using clinical diagnosis as the
source with research diagnosis as the reference, giving
diagnostic error only (57 results from 13 papers); and
those using database diagnosis as the source with research
diagnosis as the reference, giving administrative and diag-
nostic error combined (41 results from 22 papers).
Twenty-four studies examined diagnoses made as an
inpatient, while 13 included diagnoses recorded as in- or
out-patients; with two exclusively examining data from
outpatients. Eight studies concentrated on diagnoses
made at first presentation. Two studies [40, 55] specified
that more than one entry stating the diagnosis was
required for inclusion in the cohort, and a further two
[25, 43] selected inpatients with one diagnosis, but
outpatients only if they had two. Multiple instances of
diagnosis were the norm in the remainder of the studies,
except those of first episode, with various algorithms for
treating differing diagnoses: at least one, last, most
often and using a formal hierarchy. The result using the
last diagnosis was chosen for this analysis where mul-
tiple results were given, as this was shown to be a good
method [54] and thought to be most similar to where no
choice in results had been given.
The source data were coded using systems from DSM
versions III, III-R & IV or ICD versions 710 or local
codes based on these classifications (eg a Canadian version
of ICD-10 or codes specific to Veterans Affairs). Fre-
quently the administrative diagnoses covered a long time
frame, and therefore mixtures of editions were used. For
example McConville collected data from 1962 to 1996,
covering ICD versions 7, 8, 9 and 10 [45].
Fig. 1 Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram of the systematic review
Davis et al. BMC Psychiatry (2016) 16:263 Page 4 of 11

Outcomes
There was a wide range of PPVs, from 10 to 100 % with
an overall median of 76 % and a negative skew. PPV is
connected mathematically to the base rate of the condi-
tion, and simple linear regression confirmed a moderate
positive association between PPV and prevalence (r
2
=
0.27, p < <0.01, correlation coefficient (β)=0.40,).Kappa
was calculated for the 29 studies where a true negative
rate was known, giving 91 diagnosis-specific results.
Agreement using kappa ranged from <0 to 1 (i.e., from
worse than chance agreement to a perfect match), and the
distribution was fairly symmetrical. The median kappa
was 0.49, a value that is classed as a moderate inter-rater
agreement [64]. In contrast with PPV, there was no cor-
relation with prevalence (r
2
= 0.0032, p = 0.97). Due to the
dependence of PPV on prevalence, kappa would be the
preferred statistic when comparing between data sources
with different prevalence. The kappa values can also be
compared against the inter-rater reliability of the research
diagnoses in eleven of the papers. In all cases the kappa
result shows greater discordance for the source data than
between researchers: kappa for research diagnosis was
0.711, being between 1.2 to 3 times higher (median 1.7)
than the results for source data. This suggests that the
studies are demonstrating more than the reliability of the
diagnostic codes.
The median PPV and kappa results for the admini stra-
tive error group were 91 % and 0.73 respectively; for the
diagnostic error group 74 % and 0.48; for the combined
error group 77 % and 0.36. Kruskal Wallis pairwise test-
ing confirmed that kappa was higher for the administra-
tive error group versus the combined group (p = 0.006),
and not significantly different for the diagnostic error
versus the combined groups (p = 0.33). The significantly
higher kappa agreement for the administrative error only
group suggests that the error in diagnostic data overall
occurs mainly at the clinical rather than the administra-
tive stage. A few papers were able to comment directly
on the relative contribution of clinical versus administrative
error. Moilenan et al. [46] and Makikyro et al. [44] agreed,
with clinical errors greatly outnumbering administrative
ones (55 vs 9 and 16 vs 2 respectively); although Uggerby et
al. [60] was at odds, with seven clinical vs 13 administrative
errors in their research database.
We omitted the administrative error only group from
further analysis, and the results from the diagnostic error
only group and the combined error group were consid-
ered together for subsequent comparisons.
Results by diagnostic group
The Positive Predictive Value (PP V) for diagnosis for all
studies is plotted by diagnostic group in the Forest Plots
in Figs. 2 and 3, showing how the PPV varies by preva-
lence and diagnostic group amon gst other variables. For
those diagnostic categories with four or more results,
the spread is also displayed in box plot Fig. 4a, where
the range and quartile values of PPV results in the same
diagnostic category can be seen. The spread of Cohens
kappa is also shown for comparison in Fig. 4b.
The highest PPV was for the broad category of psych-
otic illness. Every study agreed that in a cohort with a
diagnosis of psychotic illness re corded in secondary care,
at least 80 % are likely to meet research criteria for this,
and most suggested over 90 %. The diagnosis of schizo-
phrenia shows a greater spread of PPV (40100 %) than
psychotic illness, but the majority of studies found the
diagnosi s at least 75 % predictive. Schizophrenia spectrum
results lie in-between that of broad psychosis and narrow
schizophrenia. Other diagnoses that have a median PPV
around 75 % are affective disorders (with approximately
the same spread as schizophrenia), unipolar depression
and bipolar affective disorder (with a wider spread). Sub-
stance misuse disorders and anxiety disorders had a lower
median PPV, while the diagnosis of schizoaffective dis-
order had a low PPV (<60 %) in all of the five studies that
included it.
The variation of kappa within diagnostic category is
very large, the range being lowest for affective disorders
(0.3), and highest for affective disorders and highest for
schizophrenia (0.7). But between diagnostic groups the
variation is small compared with PPV. The median kappa
for schizophrenia and schizophrenia spectrum disorders
are both around 0.5, as are diagnoses of depression and
bipolar disorder.
Results by inpatient status
We divided studies into those done exclusively on in-
patient data, and those that included both inpatients and
outpatients. Since around half of the studies in the in-
patient group looked only at patients in their first presen-
tation, which might be expected to have lower accuracy,
we subdivided into three groups: inpatient only, first pres-
entation only, and mixed in/outpatient. To compare them,
we considered only the most common diagnostic cate-
gories: the diagnosis-specific results for schizophrenia or
schizophrenia spectrum (schizophrenia used in preference
where both given); unipolar depression and bipolar, or
affective disorder (individual diagnoses used in preference
where given); and overall agreement. There were 25
diagnoses considered in the mixed group with median
PPV 72 % (interquartile range 4487), 13 results in the
inpatient group with median PPV 77 % (IQR 7685), and
20 result s from first presentation with median PPV 75 %
(IQR 7193). Looking at kappa (median 0.50, 0.45 and
0.49 respectively) with Kruskall-Wallis pairwise compari-
son found no significant difference between inpatient and
mixed, or between 1st and mixed presentations (p >0.1).
Davis et al. BMC Psychiatry (2016) 16:263 Page 5 of 11

Citations
More filters
Journal ArticleDOI

Screening and Management of Depression in Patients With Cardiovascular Disease: JACC State-of-the-Art Review.

TL;DR: Standardized screening pathways for depression in patients with CVD offer the potential for early identification and optimal management of depression to improve health outcomes.
Journal ArticleDOI

Pre-pandemic psychiatric disorders and risk of COVID-19: a UK Biobank cohort analysis

TL;DR: The findings suggest that pre-existing psychiatric disorders are associated with an increased risk of COVID-19, and underscore the need for surveillance of and care for populations with pre- existing psychiatric disorders during the CO VID-19 pandemic.
References
More filters
Journal ArticleDOI

The measurement of observer agreement for categorical data

TL;DR: A general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies is presented and tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interob server agreement are developed as generalized kappa-type statistics.
Journal ArticleDOI

A Coefficient of agreement for nominal Scales

TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Journal ArticleDOI

Lifetime Prevalence of Psychotic and Bipolar I Disorders in a General Population

TL;DR: In this article, a nationally representative sample of 8028 persons 30 years or older was screened for psychotic and bipolar I disorders using the Composite International Diagnostic Interview, self-reported diagnoses, medical examination, and national registers.
Related Papers (5)