scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Validating the Interpretations and Uses of Test Scores

01 Mar 2013-Journal of Educational Measurement (John Wiley & Sons, Ltd)-Vol. 50, Iss: 1, pp 1-73
TL;DR: In this article, an argument-based approach to validate an interpretation or use of test scores is proposed, where the claims based on the test scores are outlined as an argument that specifies the inferences and supporting assumptions needed to get from test responses to score-based interpretations and uses.
Abstract: To validate an interpretation or use of test scores is to evaluate the plausibility of the claims based on the scores. An argument-based approach to validation suggests that the claims based on the test scores be outlined as an argument that specifies the inferences and supporting assumptions needed to get from test responses to score-based interpretations and uses. Validation then can be thought of as an evaluation of the coherence and completeness of this interpretation/use argument and of the plausibility of its inferences and assumptions. In outlining the argument-based approach to validation, this paper makes eight general points. First, it is the proposed score interpretations and uses that are validated and not the test or the test scores. Second, the validity of a proposed interpretation or use depends on how well the evidence supports the claims being made. Third, more-ambitious claims require more support than less-ambitious claims. Fourth, more-ambitious claims (e.g., construct interpretations) tend to be more useful than less-ambitious claims, but they are also harder to validate. Fifth, interpretations and uses can change over time in response to new needs and new understandings leading to changes in the evidence needed for validation. Sixth, the evaluation of score uses requires an evaluation of the consequences of the proposed uses; negative consequences can render a score use unacceptable. Seventh, the rejection of a score use does not necessarily invalidate a prior, underlying score interpretation. Eighth, the validation of the score interpretation on which a score use is based does not validate the score use.
Citations
More filters
01 Jan 2006
TL;DR: For example, Standardi pružaju okvir koje ukazuju na ucinkovitost kvalitetnih instrumenata u onim situacijama u kojima je njihovo koristenje potkrijepljeno validacijskim podacima.
Abstract: Pedagosko i psiholosko testiranje i procjenjivanje spadaju među najvažnije doprinose znanosti o ponasanju nasem drustvu i pružaju temeljna i znacajna poboljsanja u odnosu na ranije postupke. Iako se ne može ustvrditi da su svi testovi dovoljno usavrseni niti da su sva testiranja razborita i korisna, postoji velika kolicina informacija koje ukazuju na ucinkovitost kvalitetnih instrumenata u onim situacijama u kojima je njihovo koristenje potkrijepljeno validacijskim podacima. Pravilna upotreba testova može dovesti do boljih odluka o pojedincima i programima nego sto bi to bio slucaj bez njihovog koristenja, a također i ukazati na put za siri i pravedniji pristup obrazovanju i zaposljavanju. Međutim, losa upotreba testova može dovesti do zamjetne stete nanesene ispitanicima i drugim sudionicima u procesu donosenja odluka na temelju testovnih podataka. Cilj Standarda je promoviranje kvalitetne i eticne upotrebe testova te uspostavljanje osnovice za ocjenu kvalitete postupaka testiranja. Svrha objavljivanja Standarda je uspostavljanje kriterija za evaluaciju testova, provedbe testiranja i posljedica upotrebe testova. Iako bi evaluacija prikladnosti testa ili njegove primjene trebala ovisiti prvenstveno o strucnim misljenjima, Standardi pružaju okvir koji osigurava obuhvacanje svih relevantnih pitanja. Bilo bi poželjno da svi autori, sponzori, nakladnici i korisnici profesionalnih testova usvoje Standarde te da poticu druge da ih također prihvate.

3,905 citations

Journal ArticleDOI
TL;DR: Overall, the TAM explains technology acceptance well; yet, the role of certain key constructs and the importance of external variables contrast some existing beliefs about the TAM.
Abstract: The extent to which teachers adopt technology in their teaching practice has long been in the focus of research. Indeed, a plethora of models exist explaining influential factors and mechanisms of technology use in classrooms, one of which—the Technology Acceptance Model (TAM) and versions thereof—has dominated the field. Although consensus exists about which factors in the TAM might predict teachers’ technology adoption, the current field abounds in some controversies and inconsistent findings. This meta-analysis seeks to clarify some of these issues by combining meta-analysis with structural equation modeling approaches. Specifically, we synthesized 124 correlation matrices from 114 empirical TAM studies (N = 34,357 teachers) and tested the fit of the TAM and its versions. Overall, the TAM explains technology acceptance well; yet, the role of certain key constructs and the importance of external variables contrast some existing beliefs about the TAM. Implications for research and practice are discussed.

676 citations


Cites background from "Validating the Interpretations and ..."

  • ...We believe that a thorough investigation of model fit—for both teacher samples and other subsamples—is a critical step towards creating a validity argument (Kane, 2013)....

    [...]

Journal ArticleDOI
01 Jan 2015
TL;DR: In this article, the state of research on the assessment of competencies in higher education is reviewed, and the resulting framework moves beyond dichotomies and shows how the different approaches complement each other.
Abstract: In this paper, the state of research on the assessment of competencies in higher education is reviewed. Fundamental conceptual and methodological issues are clarified by showing that current controversies are built on misleading dichotomies. By systematically sketching conceptual controversies, competing competence definitions are unpacked (analytic/trait vs. holistic/real-world performance) and common- places are identified. Disagreements are also highlighted. Similarly, competing statistical approaches to assessing competencies, namely item- response theory (latent trait) versus generalizability theory (sampling error variance), are unpacked. The resulting framework moves beyond dichotomies and shows how the different approaches complement each other. Competence is viewed along a continuum from traits that underlie perception, interpretation, and decision-making skills, which in turn give rise to observed behavior in real-world situations. Statistical approaches are also viewed along a continuum from linear to nonlinear models that serve different purposes. Item response theory (IRT) models may be used for scaling item responses and modeling structural relations, and generalizability theory (GT) models pinpoint sources of measurement error variance, thereby enabling the design of reliable measurements. The proposed framework suggests multiple new research studies and may serve as a ''grand'' structural model.

660 citations


Cites background from "Validating the Interpretations and ..."

  • ...…of test quality remain important, the range of quality criteria has been expanded to address specific characteristics of competence assessments such as authenticity, fairness, transparency, consequences for student achievement and motivation, and cost efficiency (Messick, 1995; Kane, 2013)....

    [...]

  • ...Moreover, whereas reliability and (construct) validity as classical criteria of test quality remain important, the range of quality criteria has been expanded to address specific characteristics of competence assessments such as authenticity, fairness, transparency, consequences for student achievement and motivation, and cost efficiency (Messick, 1995; Kane, 2013)....

    [...]

Journal ArticleDOI
TL;DR: Kane's framework addresses concerns of multiplicity of types of validity or failure to prioritise among sources of validity evidence by emphasising key inferences as the assessment progresses from a single observation to a final decision.
Abstract: Context Assessment is central to medical education and the validation of assessments is vital to their use. Earlier validity frameworks suffer from a multiplicity of types of validity or failure to prioritise among sources of validity evidence. Kane's framework addresses both concerns by emphasising key inferences as the assessment progresses from a single observation to a final decision. Evidence evaluating these inferences is planned and presented as a validity argument. Objectives We aim to offer a practical introduction to the key concepts of Kane's framework that educators will find accessible and applicable to a wide range of assessment tools and activities. Results All assessments are ultimately intended to facilitate a defensible decision about the person being assessed. Validation is the process of collecting and interpreting evidence to support that decision. Rigorous validation involves articulating the claims and assumptions associated with the proposed decision (the interpretation/use argument), empirically testing these assumptions, and organising evidence into a coherent validity argument. Kane identifies four inferences in the validity argument: Scoring (translating an observation into one or more scores); Generalisation (using the score[s] as a reflection of performance in a test setting); Extrapolation (using the score[s] as a reflection of real-world performance), and Implications (applying the score[s] to inform a decision or action). Evidence should be collected to support each of these inferences and should focus on the most questionable assumptions in the chain of inference. Key assumptions (and needed evidence) vary depending on the assessment's intended use or associated decision. Kane's framework applies to quantitative and qualitative assessments, and to individual tests and programmes of assessment. Conclusions Validation focuses on evaluating the key claims, assumptions and inferences that link assessment scores with their intended interpretations and uses. The Implications and associated decisions are the most important inferences in the validity argument.

355 citations

Journal ArticleDOI
TL;DR: This article reviewed a representative sample of articles published in the Journal of Personality and Social Psychology for construct validity evidence and found that validity evidence of existing and author-developed scales was lacking, with coefficient α often being the only psychometric evidence reported.
Abstract: The verity of results about a psychological construct hinges on the validity of its measurement, making construct validation a fundamental methodology to the scientific process. We reviewed a representative sample of articles published in the Journal of Personality and Social Psychology for construct validity evidence. We report that latent variable measurement, in which responses to items are used to represent a construct, is pervasive in social and personality research. However, the field does not appear to be engaged in best practices for ongoing construct validation. We found that validity evidence of existing and author-developed scales was lacking, with coefficient α often being the only psychometric evidence reported. We provide a discussion of why the construct validation framework is important for social and personality researchers and recommendations for improving practice.

335 citations


Cites background from "Validating the Interpretations and ..."

  • ..., diagnosis or research) and can often be context or population dependent (Kane, 2013; Messick, 1995)....

    [...]

  • ...Further, construct validity pertains to a specific use of a scale (e.g., diagnosis or research) and can often be context or population dependent (Kane, 2013; Messick, 1995)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This transmutability of the validation matrix argues for the comparisons within the heteromethod block as the most generally relevant validation data, and illustrates the potential interchangeability of trait and method components.
Abstract: Content Memory (Learning Ability) As Comprehension 82 Vocabulary Cs .30 ( ) .23 .31 ( ) .31 .31 .35 ( ) .29 .48 .35 .38 ( ) .30 .40 .47 .58 .48 ( ) As judged against these latter values, comprehension (.48) and vocabulary (.47), but not memory (.31), show some specific validity. This transmutability of the validation matrix argues for the comparisons within the heteromethod block as the most generally relevant validation data, and illustrates the potential interchangeability of trait and method components. Some of the correlations in Chi's (1937) prodigious study of halo effect in ratings are appropriate to a multitrait-multimethod matrix in which each rater might be regarded as representing a different method. While the published report does not make these available in detail because it employs averaged values, it is apparent from a comparison of his Tables IV and VIII that the ratings generally failed to meet the requirement that ratings of the same trait by different raters should correlate higher than ratings of different traits by the same rater. Validity is shown to the extent that of the correlations in the heteromethod block, those in the validity diagonal are higher than the average heteromethod-heterotrait values. A conspicuously unsuccessful multitrait-multimethod matrix is provided by Campbell (1953, 1956) for rating of the leadership behavior of officers by themselves and by their subordinates. Only one of 11 variables (Recognition Behavior) met the requirement of providing a validity diagonal value higher than any of the heterotrait-heteromethod values, that validity being .29. For none of the variables were the validities higher than heterotrait-monomethod values. A study of attitudes toward authority and nonauthority figures by Burwen and Campbell (1957) contains a complex multitrait-multimethod matrix, one symmetrical excerpt from which is shown in Table 6. Method variance was strong for most of the procedures in this study. Where validity was found, it was primarily at the level of validity diagonal values higher than heterotrait-heteromethod values. As illustrated in Table 6, attitude toward father showed this kind of validity, as did attitude toward peers to a lesser degree. Attitude toward boss showed no validity. There was no evidence of a generalized attitude toward authority which would include father and boss, although such values as the VALIDATION BY THE MULTITRAIT-MULTIMETHOD MATRIX

15,795 citations

Book
01 Jan 2001
TL;DR: In this article, the authors present experiments and generalized Causal inference methods for single and multiple studies, using both control groups and pretest observations on the outcome of the experiment, and a critical assessment of their assumptions.
Abstract: 1. Experiments and Generalized Causal Inference 2. Statistical Conclusion Validity and Internal Validity 3. Construct Validity and External Validity 4. Quasi-Experimental Designs That Either Lack a Control Group or Lack Pretest Observations on the Outcome 5. Quasi-Experimental Designs That Use Both Control Groups and Pretests 6. Quasi-Experimentation: Interrupted Time Series Designs 7. Regression Discontinuity Designs 8. Randomized Experiments: Rationale, Designs, and Conditions Conducive to Doing Them 9. Practical Problems 1: Ethics, Participant Recruitment, and Random Assignment 10. Practical Problems 2: Treatment Implementation and Attrition 11. Generalized Causal Inference: A Grounded Theory 12. Generalized Causal Inference: Methods for Single Studies 13. Generalized Causal Inference: Methods for Multiple Studies 14. A Critical Assessment of Our Assumptions

12,215 citations

Journal ArticleDOI
TL;DR: The present interpretation of construct validity is not "official" and deals with some areas where the Committee would probably not be unanimous, but the present writers are solely responsible for this attempt to explain the concept and elaborate its implications.
Abstract: Validation of psychological tests has not yet been adequately conceptualized, as the APA Committee on Psychological Tests learned when it undertook (1950-54) to specify what qualities should be investigated before a test is published. In order to make coherent recommendations the Committee found it necessary to distinguish four types of validity, established by different types of research and requiring different interpretation. The chief innovation in the Committee's report was the term construct validity.[2] This idea was first formulated by a subcommittee (Meehl and R. C. Challman) studying how proposed recommendations would apply to projective techniques, and later modified and clarified by the entire Committee (Bordin, Challman, Conrad, Humphreys, Super, and the present writers). The statements agreed upon by the Committee (and by committees of two other associations) were published in the Technical Recommendations (59). The present interpretation of construct validity is not "official" and deals with some areas where the Committee would probably not be unanimous. The present writers are solely responsible for this attempt to explain the concept and elaborate its implications.

9,935 citations

Book
01 Jan 1968
TL;DR: In this paper, the authors present a survey of test theory models and their application in the field of mental test analysis. But the focus of the survey is on test-score theories and models, and not the practical applications and limitations of each model studied.
Abstract: This is a reprint of the orginal book released in 1968. Our primary goal in this book is to sharpen the skill, sophistication, and in- tuition of the reader in the interpretation of mental test data, and in the construction and use of mental tests both as instruments of psychological theory and as tools in the practical problems of selection, evaluation, and guidance. We seek to do this by exposing the reader to some psychologically meaningful statistical theories of mental test scores. Although this book is organized in terms of test-score theories and models, the practical applications and limitations of each model studied receive substantial emphasis, and these discussions are presented in as nontechnical a manner as we have found possible. Since this book catalogues a host of test theory models and formulas, it may serve as a reference handbook. Also, for a limited group of specialists, this book aims to provide a more rigorous foundation for further theoretical research than has heretofore been available.One aim of this book is to present statements of the assumptions, together with derivations of the implications, of a selected group of statistical models that the authors believe to be useful as guides in the practices of test construction and utilization. With few exceptions we have given a complete proof for each major result presented in the book. In many cases these proofs are simpler, more complete, and more illuminating than those originally offered. When we have omitted proofs or parts of proofs, we have generally provided a reference containing the omitted argument. We have left some proofs as exercises for the reader, but only when the general method of proof has already been demonstrated. At times we have proved only special cases of more generally stated theorems, when the general proof affords no additional insight into the problem and yet is substantially more complex mathematically.

6,814 citations