scispace - formally typeset
Search or ask a question
Journal Article

Benchmarking Physician Performance: Reliability of Individual and Composite Measures

TL;DR: In typical health plan administrative data, most physicians do not have adequate numbers of quality events to support reliable quality measurement, and the reliability of quality measures should be taken into account when quality information is used for public reporting and accountability.
Abstract: Measuring physician performance is becoming commonplace as health plans and purchasers look for ways to drive quality improvement and to increase physicians' accountability and rewards for achieving quality goals. A recent study1 reported that, among 89% of health maintenance organization plans using physician-oriented pay-for-performance programs, more than one-third measured and rewarded quality at the individual physician level. In addition, public and private purchasers are demanding more information about America's physicians and hospitals to aid in value-based purchasing and selection of health plans and providers.2 However, concerns remain regarding the validity and reliability of such physician performance profiles. Several factors are needed to support fair and accurate comparisons among physicians. These include evidence-based quality measures, complete and accurate data sources, and standardized methods of data collection. Physician-level reliability of a quality measure is another key consideration in this measurement. Physician-level reliability refers to the ability of a quality measure to distinguish an individual physician's performance from the performance of physicians overall. Good physician-level reliability requires the following 2 factors: (1) a sufficient number of patients eligible for a given quality measure and (2) performance variation across physicians on that quality measure.3-5 The greater the number of a physician's patients who are eligible for a quality measure, the more precise the estimate of the physician's performance. When performance variation for a given quality measure across physicians is limited, the likelihood that a physician's performance is statistically significantly different from that of his or her peers is also decreased. Hofer and colleagues6 showed that not controlling for a quality measure's physician-level reliability significantly misrepresented performance differences across physicians. However, adjusting performance profiles in such a manner is not commonplace across the healthcare industry. Ensuring that measurement results are valid and reliable is important when purchasers and plans (and potentially consumers) use the data to make decisions about which physicians get financial rewards or other benefits. The stakes are particularly high when profiling results are used for public reporting or eligibility for participation in a health plan network. Paying attention to the validity and reliability of data will help to ensure that these decisions are based on real differences in performance among physicians rather than any shortcomings of the measurement. Although performance results based on limited sample sizes could be adjusted for the reliability of individual measures,7-9 the creation of composite scores may also be a useful way to increase the reliability of physicians' performance scores.10 Little is known about the extent to which constructing composite scores mitigates the limitations of sample size and reliability, while continuing to provide useful and understandable information.11 To date, there have been few reports regarding the reliability of physician-level performance scores associated with commonly used practices and methods in the healthcare industry. To begin to address this deficiency, this study relied on a large data set that combined patient-level administrative data from 9 large health plans to compute performance for primary care physicians (PCPs) using 27 commonly measured quality indicators. This data set is typical of data sources often used by individual health plans to profile physician performance. Specifically, we examined for each quality measure and composite score the proportion of PCPs who could be evaluated given different minimum sample size criteria and the physician-level reliability under those minimum sample size criteria. Our primary research questions were the following: (1) What is the physician-level reliability of commonly used performance measures calculated exclusively based on administrative data? (2) Can more physicians be reliably evaluated using a composite score?
Citations
More filters
Book
05 Jun 2013
TL;DR: The knowledge and tools exist to put the health system on the right course to achieve continuous improvement and better quality care at a lower cost, and a better use of data is a critical element of a continuously improving health system.
Abstract: America's health care system has become too complex and costly to continue business as usual. Best Care at Lower Cost explains that inefficiencies, an overwhelming amount of data, and other economic and quality barriers hinder progress in improving health and threaten the nation's economic stability and global competitiveness. According to this report, the knowledge and tools exist to put the health system on the right course to achieve continuous improvement and better quality care at a lower cost.The costs of the system's current inefficiency underscore the urgent need for a systemwide transformation. About 30 percent of health spending in 2009--roughly $750 billion--was wasted on unnecessary services, excessive administrative costs, fraud, and other problems. Moreover, inefficiencies cause needless suffering. By one estimate, roughly 75,000 deaths might have been averted in 2005 if every state had delivered care at the quality level of the best performing state. This report states that the way health care providers currently train, practice, and learn new information cannot keep pace with the flood of research discoveries and technological advances.About 75 million Americans have more than one chronic condition, requiring coordination among multiple specialists and therapies, which can increase the potential for miscommunication, misdiagnosis, potentially conflicting interventions, and dangerous drug interactions. Best Care at Lower Cost emphasizes that a better use of data is a critical element of a continuously improving health system, such as mobile technologies and electronic health records that offer significant potential to capture and share health data better. In order for this to occur, the National Coordinator for Health Information Technology, IT developers, and standard-setting organizations should ensure that these systems are robust and interoperable. Clinicians and care organizations should fully adopt these technologies, and patients should be encouraged to use tools, such as personal health information portals, to actively engage in their care.This book is a call to action that will guide health care providers; administrators; caregivers; policy makers; health professionals; federal, state, and local government agencies; private and public health organizations; and educational institutions.

1,324 citations

Journal ArticleDOI
TL;DR: The reliability results were used to estimate the proportion of physicians in each specialty whose cost performance would be classified inaccurately in a two-tiered insurance product in which the physicians with cost profiles in the lowest quartile were labeled as "lower cost."
Abstract: BACKGROUND Insurance products with incentives for patients to choose physicians classified as offering lower-cost care on the basis of cost-profiling tools are increasingly common. However, no rigorous evaluation has been undertaken to determine whether these tools can accurately distinguish higher-cost physicians from lower-cost physicians. METHODS We aggregated claims data for the years 2004 and 2005 from four health plans in Massachusetts. We used commercial software to construct clinically homogeneous episodes of care (e.g., treatment of diabetes, heart attack, or urinary tract infection), assigned each episode to a physician, and created a summary profile of resource use (i.e., cost) for each physician on the basis of all assigned episodes. We estimated the reliability (signal-to-noise ratio) of each physician’s cost-profile score on a scale of 0 to 1, with 0 indicating that all differences in physicians’ cost profiles are due to a lack of precision in the measure (noise) and 1 indicating that all differences are due to real variation in costs of services (signal). We used the reliability results to estimate the proportion of physicians in each specialty whose cost performance would be classified inaccurately in a two-tiered insurance product in which the physicians with cost profiles in the lowest quartile were labeled as “lower cost.” RESULTS Median reliabilities ranged from 0.05 for vascular surgery to 0.79 for gastroenterology and otolaryngology. Overall, 59% of physicians had cost-profile scores with reliabilities of less than 0.70, a commonly used marker of suboptimal reliability. Using our reliability results, we estimated that 22% of physicians would be misclassified in a two-tiered system. CONCLUSIONS Current methods for profiling physicians with respect to costs of services may produce misleading results.

159 citations

Journal ArticleDOI
TL;DR: The analysis reveals that designing a fair and effective program is a complex undertaking, and the design of P4P programs should be tailored to the specific setting of implementation, and empirical research is needed to confirm the conclusions.
Abstract: Pay for performance (P4P) is increasingly being used to stimulate healthcare providers to improve their performance. However, evidence on P4P effectiveness remains inconclusive. Flaws in program design may have contributed to this limited success. Based on a synthesis of relevant theoretical and empirical literature, this paper discusses key issues in P4P-program design. The analysis reveals that designing a fair and effective program is a complex undertaking. The following tentative conclusions are made: (1) performance is ideally defined broadly, provided that the set of measures remains comprehensible, (2) concerns that P4P encourages “selection” and “teaching to the test” should not be dismissed, (3) sophisticated risk adjustment is important, especially in outcome and resource use measures, (4) involving providers in program design is vital, (5) on balance, group incentives are preferred over individual incentives, (6) whether to use rewards or penalties is context-dependent, (7) payouts should be frequent and low-powered, (8) absolute targets are generally preferred over relative targets, (9) multiple targets are preferred over single targets, and (10) P4P should be a permanent component of provider compensation and is ideally “decoupled” form base payments. However, the design of P4P programs should be tailored to the specific setting of implementation, and empirical research is needed to confirm the conclusions.

147 citations


Cites background from "Benchmarking Physician Performance:..."

  • ...Patient panels of individual physicians and small groups are typically too small to measure performance reliably [2, 54, 56, 63, 76, 90]....

    [...]

  • ...quality than individual measures and do not guarantee reliability levels sufficient to enable inclusion of large shares of providers [90]....

    [...]

  • ...In that case, replacing and/or updating measures are warranted, also because variation in performance may have become too small to measure performance reliably and to discriminate across providers [63, 90]....

    [...]

Journal ArticleDOI
TL;DR: Although the programs share many similarities, they differ in several important respects, also when compared with the typical P4P program in the United States, there are clearly possibilities to increase incentive strength and minimize incentives for undesired behavior.
Abstract: Pay for performance (P4P) has become a popular approach to performance improvement in health care. Most of the P4P literature has focused on the United States and there is limited insight in the characteristics of major programs initiated in other countries. This article systematically describes and reviews P4P programs outside the United States. Our literature search identified 13 programs initiated in 9 countries. Although the programs share many similarities, they differ in several important respects, also when compared with the typical P4P program in the United States. In addition, there are clearly possibilities to increase incentive strength and minimize incentives for undesired behavior. In part, observed heterogeneity will be a consequence of contextual differences, but design choices often also seem to be made arbitrarily. In designing their programs, purchasers are hampered by limited knowledge of the influence of specific design choices and effective strategies to mitigate undesired behavior.

145 citations


Cites background from "Benchmarking Physician Performance:..."

  • ...Yet for many measures included in these programs, sample size may well be insufficient to generate reliable profiles, especially for outcomes and resource use (Hofer et al., 1999; Krein et al., 2002; Mehrotra, Adams, Thomas, & McGlynn, 2010; Scholle et al., 2008)....

    [...]

BookDOI
28 Apr 2015
TL;DR: A streamlined set of 15 standardized measures could provide consistent benchmarks for health progress across the nation and improve system performance in the highest-priority areas.
Abstract: Thousands of measures are in use today to assess health and health care in the United States. Although many of these measures provide useful information, their usefulness in either gauging or guiding performance improvement in health and health care is seriously limited by their sheer number, as well as their lack of consistency, compatibility, reliability, focus, and organization. To achieve better health at lower cost, all stakeholders-including health professionals, payers, policy makers, and members of the public-must be alert to what matters most. What are the core measures that will yield the clearest understanding and focus on better health and well-being for Americans? Vital Signs explores the most important issues-healthier people, better quality care, affordable care, and engaged individuals and communities-and specifies a streamlined set of 15 core measures. These measures, if standardized and applied at national, state, local, and institutional levels across the country, will transform the effectiveness, efficiency, and burden of health measurement and help accelerate focus and progress on our highest health priorities. Vital Signs also describes the leadership and activities necessary to refine, apply, maintain, and revise the measures over time, as well as how they can improve the focus and utility of measures outside the core set.If health care is to become more effective and more efficient, sharper attention is required on the elements most important to health and health care. Vital Signs lays the groundwork for the adoption of core measures that, if systematically applied, will yield better health at a lower cost for all Americans.

141 citations

References
More filters
Journal ArticleDOI
TL;DR: In this article, the authors present guidelines for choosing among six different forms of the intraclass correlation for reliability studies in which n target are rated by k judges, and the confidence intervals for each of the forms are reviewed.
Abstract: Reliability coefficients often take the form of intraclass correlation coefficients. In this article, guidelines are given for choosing among six different forms of the intraclass correlation for reliability studies in which n target are rated by k judges. Relevant to the choice of the coefficient are the appropriate statistical model for the reliability and the application to be made of the reliability results. Confidence intervals for each of the forms are reviewed.

21,185 citations

Journal ArticleDOI
09 Jun 1999-JAMA
TL;DR: Use of individual physician profiles may foster an environment in which physicians can most easily avoid being penalized by avoiding or deselecting patients with high prior cost, poor adherence, or response to treatments.
Abstract: ContextPhysician profiling is widely used by many health care systems, but little is known about the reliability of commonly used profiling systems.ObjectivesTo determine the reliability of a set of physician performance measures for diabetes care, one of the most common conditions in medical practice, and to examine whether physicians could substantially improve their profiles by preferential patient selection.Design and SettingCohort study performed from 1990 to 1993 at 3 geographically and organizationally diverse sites, including a large staff-model health maintenance organization, an urban university teaching clinic, and a group of private-practice physicians in an urban area.ParticipantsA total of 3642 patients with type 2 diabetes cared for by 232 different physicians.Main Outcome MeasuresPhysician profiles for their patients' hospitalization and clinic visit rates, total laboratory resource utilization rate and level of glycemic control by average hemoglobin A1c level with and without detailed case-mix adjustment.ResultsFor profiles based on hospitalization rates, visit rates, laboratory utilization rates, and glycemic control, 4% or less of the overall variance was attributable to differences in physician practice and the reliability of the median physician's case-mix–adjusted profile was never better than 0.40. At this low level of physician effect, a physician would need to have more than 100 patients with diabetes in a panel for profiles to have a reliability of 0.80 or better (while more than 90% of all primary care physicians at the health maintenance organization had fewer than 60 patients with diabetes). For profiles of glycemic control, high outlier physicians could dramatically improve their physician profile simply by pruning from their panel the 1 to 3 patients with the highest hemoglobin A1clevels during the prior year. This advantage from gaming could not be prevented by even detailed case-mix adjustment.ConclusionsPhysician "report cards" for diabetes, one of the highest-prevalence conditions in medical practice, were unable to detect reliably true practice differences within the 3 sites studied. Use of individual physician profiles may foster an environment in which physicians can most easily avoid being penalized by avoiding or deselecting patients with high prior cost, poor adherence, or response to treatments.

496 citations

Journal ArticleDOI
03 Sep 2003-JAMA
TL;DR: It is concluded that important technical barriers stand in the way of using physician clinical performance assessment for evaluating the competency of individual physicians and that overcoming these barriers will require considerable additional research and development.
Abstract: The performance of physicians in their day-to-day clinical practices has become an area of intense public interest. Both patients and health care purchasers want more effective means of identifying excellent clinicians, and a variety of organizations are discussing and implementing plans for assessing the performance of individual clinicians. In this article, we review the current state of physician clinical performance assessment with a focus on its usefulness for competency assessment. We describe recommendations for a physician clinical performance assessment system for these purposes, and identify ways in which current methods of performance assessment fall short of these. We conclude that important technical barriers stand in the way of using physician clinical performance assessment for evaluating the competency of individual physicians. Overcoming these barriers will require considerable additional research and development. Even then, for some uses, physician clinical performance assessment at the individual physician level may be technically impossible to accomplish in a valid and fair way.

250 citations

Journal ArticleDOI
TL;DR: The effects of patient characteristics and physician-level clustering on quality assessment results among patients with diabetes are illustrated and the Provider Recognition Program data is compared to compare two groups of physicians, in this case generalists and specialists, whose results were expected to differ.
Abstract: The findings of this study underscore the importance of designing physician profiling studies with sufficient power to account for physician-level variation (clustering) as well as patient case-mix...

231 citations

Journal ArticleDOI
TL;DR: The Massachusetts Ambulatory Care Experiences Survey Project was a statewide demonstration project designed to test the feasibility and value of measuring patients’ experiences with individual primary care physicians and their practices and underscores the validity and importance of looking beyond health plans to individual physicians and sites as the authors seek to improve health care quality.
Abstract: BACKGROUND: Measuring and reporting patients’ experiences with health plans has been routine for several years. There is now substantial interest in measuring patients’ experiences with individual physicians, but numerous concerns remain.

230 citations


"Benchmarking Physician Performance:..." refers background in this paper

  • ...Good physician-level reliability requires the following 2 factors: (1) a sufficient number of patients eligible for a given quality measure and (2) performance variation across physicians on that quality measure....

    [...]

  • ...A quality measure can have good reliability (1) because there is comparatively high physician-to-physician variance or (2) because there is not much “noise” or measurement error in the estimate of the individual physician performance, usually as a result of large sample sizes....

    [...]

  • ...Our primary research questions were the following: (1) What is the physician-level reliability of commonly used performance measures calculated exclusively based on administrative data? (2) Can more physicians be reliably evaluated using a composite score?...

    [...]