scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A comparison of reliability and precision of subscore reporting methods for a state English language proficiency assessment

01 Apr 2018-Language Testing (SAGE PublicationsSage UK: London, England)-Vol. 35, Iss: 2, pp 297-317
TL;DR: K-12 English language proficiency tests that assess multiple content domains (e.g., listening, speaking, reading, writing) often have subsections based on these content domains; scores assigned to...
Abstract: K–12 English language proficiency tests that assess multiple content domains (e.g., listening, speaking, reading, writing) often have subsections based on these content domains; scores assigned to ...
Citations
More filters
01 Jan 2016
TL;DR: The assessing young language learners is universally compatible with any devices to read, and is available in the digital library an online access to it is set as public so you can get it instantly.
Abstract: Thank you very much for reading assessing young language learners. As you may know, people have search hundreds times for their chosen readings like this assessing young language learners, but end up in malicious downloads. Rather than reading a good book with a cup of coffee in the afternoon, instead they are facing with some harmful bugs inside their desktop computer. assessing young language learners is available in our digital library an online access to it is set as public so you can get it instantly. Our book servers spans in multiple locations, allowing you to get the most less latency time to download any of our books like this one. Kindly say, the assessing young language learners is universally compatible with any devices to read.

113 citations

Journal Article

42 citations

Journal ArticleDOI
TL;DR: This paper conducted a systematic review of the item response theory literature in language assessment to investigate the conceptualization and operationalization of the dimensionality of language ability, and found that exploratory factor analysis was the primary method of dimensionality analysis in papers that had applied unidimensional IRT models, whereas the comparison modeling approach was dominant in the multidimensional framework.

12 citations

Journal ArticleDOI
TL;DR: This article argued that reporting a subscore is not always justified, and that the subscore should provide reliable and distinct information to be worth repurchasing, and not just to report a score.
Abstract: Stakeholders of language tests are often interested in subscores. However, reporting a subscore is not always justified; a subscore should provide reliable and distinct information to be worth repo...

8 citations


Cites background or result from "A comparison of reliability and pre..."

  • ...However, the language testing literature on fine-grained feedback largely lacks an explicit discussion about whether and when such information is psychometrically justified, with notable exceptions of Longabach and Peyton (2018), Papageorgiou and Choi (2018), and Sawaki and Sinharay (2013)....

    [...]

  • ...Longabach and Peyton (2018) examined subscores from a K–12 English language proficiency test, and found that, compared to other augmentation methods, augmentation using MIRT (Yao & Boughton, 2007; Haberman & Sinharay, 2010) improved the subscore reliability the most....

    [...]

  • ...Both Longabach and Peyton (2018) and Papageorgiou and Choi (2018) focused on subscores for individual test takers and did not evaluate group-level subscores....

    [...]

Journal ArticleDOI
15 Nov 2019
TL;DR: In this article, the validity and reliability of these English language proficiency assessments for English language learners in early-exit TBE programs in elementary schools was investigated. But the validity of these assessments was not evaluated.
Abstract: English language learners (ELLs) have become one of the fastest growing groups in elementary schools in the United States of America. This paper will research the English proficiency assessments currently used for ELLs in the early-exit TBE programs in elementary schools and the validity and reliability of these English language proficiency assessments.

5 citations


Cites background or methods from "A comparison of reliability and pre..."

  • ...According to Longabach and Peyton (2017), it is of great importance to assess the sub-score reliability and total score reliability when assessing the English proficiency of ELLs because the ultimate goal of assessment is to improve the education for ELLs, which relies on the accuracy of the…...

    [...]

  • ...Longabach and Peyton (2017) used Cronbach爷s alpha and standard error of measurement respectively to estimate the reliability and precision of the four methods....

    [...]

  • ...According to Longabach and Peyton (2017), MIRT was found to be the most reliable one among the four methods to score the sub-domains for all grade levels....

    [...]

References
More filters
Book
08 Sep 2003
TL;DR: Preface 1 Testing, Assessing, and Teaching What Is a Test?
Abstract: Chapter 1: Assessment Concepts and Issues Assessment and Testing Measurement and Evaluation Assessment and Learning Informal and Formal Assessment Formative and Summative Assessment Norm-Referenced and Criterion-Referenced Tests Types and Purposes of Assessment Achievement Tests Diagnostic Tests Placement Tests Proficiency Tests Aptitude Tests Issues in Language Assessment: Then and Now Behavioral Influences on Language Testing Integrative Approaches Communicative Language Testing Performance-based Assessment Current "Hot Topics" in Classroom-based Assessment Multiple Intelligences Traditional and "Alternative" Assessment Computer-based Testing Other Current Issues Exercises For Your Further Reading Chapter 2: Principles Of Language Assessment Practicality Reliability Student-Related Reliability Rater Reliability Test Administration Reliability Test Reliability Validity Content-Related Evidence Criterion-Related Evidence Construct-Related Evidence Consequential Validity (Impact) Face Validity Authenticity Washback Applying Principles to the Evaluation of Classroom Tests Exercises For Your Further Reading Chapter 3: Designing Classroom Language Tests Four Assessment Scenarios Determining the Purpose of a Test Designing Clear, Unambiguous Objectives Drawing Up Test Specifications Devising Test Items Designing Multiple-Choice Items Administering the Test Scoring, Grading, and Giving Feedback Scoring Grading Giving Feedback Exercises For Your Further Reading Chapter 4: Standards-Based Assessment The Role of Standards in Standardized Tests Standards-Based Education Designing English Language Standards Standards-Based Assessment CASAS and SCANS Teacher Standards The Consequences of Standards-Based and Standardized Testing Test Bias Test-Driven Language and Teaching Ethical Issues: Critical Language Testing Exercises For Your Further Reading Chapter 5: Standardized Testing Advantages and Disadvantages of Standardized Tests Developing a Standardized Test Standardized Language Proficiency Testing Exercises For Your Further Reading Chapter 6: Beyond Tests: Alternatives in Assessment The Dilemma of Maximizing Both Practicality and Washback Performance-Based Assessment Rubrics Portfolios Journals Conferences and Interviews Observations Self- and Peer-Assessments Types of Self- and Peer-Assessment Guidelines for Self- and peer-Assessment A Taxonomy of Self- and Peer-Assessment Tasks Exercises For Your Further Reading Chapter 7: Assessing Listening Integration of Skills in Language Assessment Assessing Grammar and Vocabulary Observing the Performance of the Four Skills The Importance of Listening Basic Types of Listening Micro- and Macroskills of Listening Designing Assessment Tasks: Intensive Listening Recognizing Phonological and Morphological Elements Paraphrase Recognition Designing Assessment Tasks: Responsive Listening Designing Assessment Tasks: Selective Listening Listening Cloze Information Transfer Sentence Repetition Designing Assessment Tasks: Extensive Listening Dictation Communicative Stimulus-Response Tasks Authentic Listening Tasks Exercises For Your Further Reading Chapter 8: Assessing Speaking Basic Types of Speaking Micro- and Macroskills of Speaking Designing Assessment Tasks: Imitative Speaking Versant (R) Designing Assessment Tasks: Intensive Speaking Directed Response Tasks Read-Aloud Tasks Sentence/Dialogue Completion Tasks and Oral Questionnaires Picture-Cued Tasks Translation (of Limited Stretches of Discourse) Designing Assessment Tasks: Responsive Speaking Question and Answer Giving Instructions and Directions Paraphrasing Test of Spoken English (TSE (R)) Designing Assessment Tasks: Interactive Speaking Interview Role Play Discussions and Conversations Games ACTFL Oral Proficiency Interview (OPI) Designing Assessments: Extensive Speaking Oral Presentations Picture-Cued Story-Telling Retelling a Story, News Event Translation (of Extended Prose) Exercises For Your Further Reading Chapter 9: Assessing Reading Genres of Reading Microskills, Macroskills, and Strategies for Reading Types of Reading Designing Assessment Tasks: Perceptive Reading Reading Aloud Written Response Multiple-Choice Picture-Cued Items Designing Assessment Tasks: Selective Reading Multiple Choice (for Form-Focused Criteria) Matching Tasks Editing Tasks Picture-Cued Tasks Gap-Filling Tasks Designing Assessment Tasks: Interactive Reading Cloze Tasks Impromptu Reading Plus Comprehension Questions Short-Answer Tasks Editing (Longer Texts) Scanning Ordering Tasks Information Transfer: Reading Charts, Maps, Graphs, Diagrams Designing Assessment Tasks: Extensive Reading Skimming Tasks Summarizing and Responding Note-Taking and Outlining Exercises For Your Further Reading Chapter 10: Assessing Writing Genres of Written Language Types of Writing Performance Micro- and Macroskills of Writing Designing Assessment Tasks: Imitative Writing Tasks in [Hand] Writing Letters, Words, and Punctuation Spelling Tasks and Detecting Phoneme-Grapheme Correspondences Designing Assessment Tasks: Intensive (Controlled) Writing Dictation and Dicto-Comp Grammatical Transformation Tasks Picture-Cued Tasks Vocabulary Assessment Tasks Ordering Tasks Short-Answer and Sentence Completion Tasks Issues in Assessing Responsive and Extensive Writing Designing Assessment Tasks: Responsive and Extensive Writing Paraphrasing Guided Question and Answer Paragraph Construction Tasks Strategic Options Scoring Methods for Responsive and Extensive Writing Holistic Scoring Primary Trait Scoring Analytic Scoring Beyond Scoring: Responding to Extensive Writing Assessing Initial Stages of the Process of Composing Assessing Later Stages of the Process of Composing Exercises For Your Further Reading Chapter 11: Assessing Grammar and Vocabulary Assessing Grammar Defining Grammatical Knowledge Designing Assessment Tasks: Selected Response Multiple-Choice (MC) Tasks Discrimination Tasks Noticing Tasks or Consciousness Raising Tasks Designing Assessment Tasks: Limited Production Gap-filling Tasks Short Answer Tasks Dialogue Completion Tasks Designing Assessment Tasks: Extended Production Information-gap Tasks Role Play or Simulation Tasks Assessing Vocabulary The Nature of Vocabulary Defining Lexical Knowledge Some Considerations in Designing Assessment Tasks Designing Assessment Tasks: Receptive Vocabulary Designing Assessment Tasks: Productive Vocabulary Exercises For Your Further Reading Chapter 12: Grading and Student Evaluation Philosophy of Grading: What Should Grades Reflect? Guidelines for Selecting Grading Criteria Methods for Calculating Grades Teachers' Perceptions of Appropriate Grade Distributions Institutional Expectations and Constraints Cross-Cultural Factors and the Question of Difficulty What Do Letter Grades "Mean"? Calculating Grades Alternatives to Letter Grading Some Principles and Guidelines for Grading and Evaluation Exercises For Your Further Reading Appendix: Commercial Tests Glossary Bibliography Name Index Subject Index

2,731 citations


"A comparison of reliability and pre..." refers background in this paper

  • ...…domains on various grounds related to the inappropriate representation of the construct in question, some authors (e.g., Brindley & Slatyer, 2002; Brown, 2004; Khoii & Paydarnia, 2001; Shin, 2007) specifically draw attention to the fact that listening skills, while critical to ELD, are not very…...

    [...]

  • ...Although a number of studies criticize language assessment tasks for all domains on various grounds related to the inappropriate representation of the construct in question, some authors (e.g., Brindley & Slatyer, 2002; Brown, 2004; Khoii & Paydarnia, 2001; Shin, 2007) specifically draw attention to the fact that listening skills, while critical to ELD, are not very well understood and hard to assess....

    [...]

Journal ArticleDOI
TL;DR: The mirt package was created for estimating multidimensional item response theory parameters for exploratory and confirmatory models by using maximum-likelihood meth- ods.
Abstract: Item response theory (IRT) is widely used in assessment and evaluation research to explain how participants respond to item level stimuli. Several R packages can be used to estimate the parameters in various IRT models, the most flexible being the ltm (Rizopoulos 2006), eRm (Mair and Hatzinger 2007), and MCMCpack (Martin, Quinn, and Park 2011) packages. However these packages have limitations in that ltm and eRm can only analyze unidimensional IRT models effectively and the exploratory multidimensional extensions available in MCMCpack requires prior understanding of Bayesian estimation convergence diagnostics and are computationally intensive. Most importantly, multidimensional confirmatory item factor analysis methods have not been implemented in any R package. The mirt package was created for estimating multidimensional item response theory parameters for exploratory and confirmatory models by using maximum-likelihood meth- ods. The Gauss-Hermite quadrature method used in traditional EM estimation (e.g., Bock and Aitkin 1981) is presented for exploratory item response models as well as for confirmatory bifactor models (Gibbons and Hedeker 1992). Exploratory and confirmatory models are estimated by a stochastic algorithm described by Cai (2010a,b). Various program comparisons are presented and future directions for the package are discussed.

1,420 citations


"A comparison of reliability and pre..." refers methods in this paper

  • ...The present research was carried out using only the R package mirt (Chalmers, 2012; code available upon request), which not only simplified calculations and parameter estimations, but also made them more consistent from one method of subscore reporting to another....

    [...]

Journal ArticleDOI
TL;DR: The generalized partial credit model (GPCM) as discussed by the authors is a generalized PCM with a varying slope parameter, which is based on Andrich's (1978) rating scale formulation.
Abstract: The partial credit model (PCM) with a varying slope parameter is developed and called the generalized partial credit model (GPCM). The item step parameter of this model is decomposed to a location and a threshold parameter, following Andrich's (1978) rating scale formulation. The EM algorithm for estimating the model parameters is derived. The performance of this generalized model is compared on both simulated and real data to a Rasch family of polytomous item response models. Simulated data were generated and then analyzed by the various polytomous item response models. The results demonstrate that the rating formulation of the GPCM is quite adaptable to the analysis of polytomous item responses. The real data used in this study consisted of the National Assessment of Educational Progress (Johnson & Allen, 1992) mathematics data that used both dichotomous and polytomous items. The PCM was applied to these data using both constant and varying slope parameters. The GPCM, which provides for varying slope pa...

1,219 citations

Book ChapterDOI
TL;DR: In this paper, the authors describe the commonly used multidimensional item response theory (MIRT) models and the important methods needed for their practical application, including ways to determine the number of dimensions required to adequately model data, procedures for estimating model parameters, ways to define the space for a MIRT model, and procedures for transforming calibrations from different samples to put them in the same space.
Abstract: Multidimensional Item Response Theory is the first book to give thorough coverage to this emerging area of psychometrics. The book describes the commonly used multidimensional item response theory (MIRT) models and the important methods needed for their practical application. These methods include ways to determine the number of dimensions required to adequately model data, procedures for estimating model parameters, ways to define the space for a MIRT model, and procedures for transforming calibrations from different samples to put them in the same space. A full chapter is devoted to methods for multidimensional computerized adaptive testing. The text is appropriate for an advanced course in psychometric theory or as a reference work for those interested in applying MIRT methodology. A working knowledge of unidimensional item response theory and matrix algebra is assumed. Knowledge of factor analysis is also helpful.

868 citations


"A comparison of reliability and pre..." refers methods in this paper

  • ...Item parameters are calculated from the factor loadings of each item on each subscale and the estimated covariance between subscores’ matrices, and then the person parameter is calculated based on these values (Reckase, 2009)....

    [...]

Journal ArticleDOI

682 citations


"A comparison of reliability and pre..." refers background in this paper

  • ...IRT, on the other hand, is based on the premise that the probability of a correct response to an item is a function of person’s trait (such as ELD) and item parameters (difficulty, discrimination, and guessing) (Hambleton & Jones, 1993)....

    [...]