scispace - formally typeset
Search or ask a question

Showing papers in "ETS Research Report Series in 2013"


Journal ArticleDOI
TL;DR: A new corpus of non-native English writing will be useful for the task of native language identification, as well as grammatical error detection and correction, and automatic essay scoring.
Abstract: This report presents work on the development of a new corpus of non-native English writing. It will be useful for the task of native language identification, as well as grammatical error detection and correction, and automatic essay scoring. In this report, the corpus is described in detail.

184 citations


Journal ArticleDOI
TL;DR: In this paper, the authors conducted an analysis of the Occupational Information Network (O*NET) database and identified 15 components: problem solving, mechanical skills, service orientation, cultural literacy, business literacy, science literacy, civic literacy, information processing, athleticism, visual acuity, fluid intelligence, communication skills, teamwork, achievement/innovation and attention to detail/near vision.
Abstract: To identify the most important competencies for college graduates to succeed in the 21st century workforce, we conducted an analysis of the Occupational Information Network (O*NET) database. O*NET is a large job analysis operated and maintained by the U.S. Department of Labor. We specifically analyzed ratings of the importance of abilities (52 ratings), work styles (16 ratings), skills (35 ratings), and knowledge (33 ratings) to succeed in one's occupation. First, we conducted descriptive analyses. Next, data were split into 2 sets, according to the theoretical structure proposed by the O*NET content model, and principal component analyses (PCAs) were run on each dataset. The PCAs identified 15 components: problem solving, mechanical skills, service orientation, cultural literacy, business literacy, science literacy, civic literacy, information processing, athleticism, visual acuity, fluid intelligence, communication skills, teamwork, achievement/innovation, and attention to detail/near vision. Components were then ranked in importance using the mean component scores over all occupations. A comparison of this ranking with previous 21st century competencies frameworks suggested that 5 competencies stand out as important for most occupations: problem solving (e.g., complex problem solving), fluid intelligence (e.g., category flexibility), teamwork (e.g., cooperation), achievement/innovation (e.g., persistence), and communication skills (e.g., oral expression). Consistent with this conclusion, a correlation of component scores with wages found that 4 of these 5 competencies were strongly related to wages, with the exception being teamwork.

66 citations



Journal ArticleDOI
TL;DR: The authors developed a working model of persistence informed by a literature review, which resulted in a model centered on three basic categories of variables: those that put students on track towards persistence, those that push them off track, and those that keep them on track.
Abstract: Despite near universal acceptance in the value of higher education for individuals and society, college persistence rates in 4-year and community colleges are low. Only 57% of students who began college at a 4–year institution in 2001 had completed a bachelor's degree by 2007, and only 28% of community college students who started school in 2005 had completed a degree 4 years later (National Center for Education Statistics, 2011). To address this problem, this paper identified 3 goals. The first was to review the extant literature on persistence in higher education. The second was to develop a working model of persistence informed by our literature review. This resulted in a model centered on 3 basic categories of variables: those that put you on track towards persistence, those that push you off track, and those that keep you on track. The final goal was to outline a research agenda to develop student-centered assessments informed by our model, and we conclude with a discussion of this agenda.

55 citations


Journal ArticleDOI
TL;DR: This article conducted a comprehensive linguistic analysis of TOEFL iBT responses, interpreting observed linguistic patterns of variation relative to three parameters: mode, task type, and score level of test takers.
Abstract: One of the major innovations of the TOEFL iBT® test is the incorporation of integrated tasks complementing the independent tasks to which examinees respond. In addition, examinees must produce discourse in both modes (speech and writing). The validity argument for the TOEFL iBT includes the claim that examinees vary their discourse in accordance with these considerations as they become more proficient in their academic language skills (the explanation inference). To provide evidence in support of this warrant, we undertake a comprehensive lexico-grammatical description of the discourse produced in response to integrated versus independent tasks, across the spoken and written modes, by test takers from different score levels. Discourse descriptions at several linguistic levels are provided, including vocabulary profiles, collocational patterns, the use of extended lexical bundles, distinctive lexico-grammatical features, and a multidimensional (MD) analysis that describes the overall patterns of linguistic variation. In sum, we undertake a comprehensive linguistic analysis of the discourse of TOEFL iBT responses, interpreting observed linguistic patterns of variation relative to three parameters that are relevant in the TOEFL iBT context: mode, task type, and score level of test takers.

53 citations


Journal ArticleDOI
TL;DR: The third installment of the Reading for Understanding (RfU) assessment framework is presented in this paper, where the role of performance moderators in the test design and how scenario-based assessment can be used as a tool for assessment delivery is discussed.
Abstract: This paper represents the third installment of the Reading for Understanding (RfU) assessment framework. This paper builds upon the two prior installments (Sabatini & O'Reilly, 2013; Sabatini, O'Reilly, & Deane, 2013) by discussing the role of performance moderators in the test design and how scenario-based assessment can be used as a tool for assessment delivery. Performance moderators are characteristics of students that impact reading performance but are not considered a part of the reading construct. These include (a) background and prior knowledge, (b) metacognitive and self-regulatory strategies and behavior, (c) reading strategies, and (d) student motivation and engagement. In this paper, we argue there is added value in incorporating performance moderators into a reading test design. We characterize added value with respect to the validity of the claims derived from test scores, the interpretation of the test scores, and the relevance to instruction. As a second aim, we present a case for using scenario-based assessments and how they can be used to integrate into the test design both the performance moderators as well as other features that make the assessment more instructionally relevant.

50 citations


Journal Article
TL;DR: In this paper, the authors conducted an analysis of the Occupational Information Network (O*NET) database and identified 15 components: problem solving, mechanical skills, service orientation, cultural literacy, business literacy, science literacy, civic literacy, information processing, athleticism, visual acuity, fluid intelligence, communication skills, teamwork, achievement/innovation and attention to detail/near vision.
Abstract: To identify the most important competencies for college graduates to succeed in the 21st century workforce, we conducted an analysis of the Occupational Information Network (O*NET) database. O*NET is a large job analysis operated and maintained by the U.S. Department of Labor. We specifically analyzed ratings of the importance of abilities (52 ratings), work styles (16 ratings), skills (35 ratings), and knowledge (33 ratings) to succeed in one's occupation. First, we conducted descriptive analyses. Next, data were split into 2 sets, according to the theoretical structure proposed by the O*NET content model, and principal component analyses (PCAs) were run on each dataset. The PCAs identified 15 components: problem solving, mechanical skills, service orientation, cultural literacy, business literacy, science literacy, civic literacy, information processing, athleticism, visual acuity, fluid intelligence, communication skills, teamwork, achievement/innovation, and attention to detail/near vision. Components were then ranked in importance using the mean component scores over all occupations. A comparison of this ranking with previous 21st century competencies frameworks suggested that 5 competencies stand out as important for most occupations: problem solving (e.g., complex problem solving), fluid intelligence (e.g., category flexibility), teamwork (e.g., cooperation), achievement/innovation (e.g., persistence), and communication skills (e.g., oral expression). Consistent with this conclusion, a correlation of component scores with wages found that 4 of these 5 competencies were strongly related to wages, with the exception being teamwork.

46 citations


Journal ArticleDOI
TL;DR: In this paper, the authors give an overview of the research conducted in several fields of work related to collaboration, propose a framework for the assessment of cognitive skills (such as science or math) through collaborative problem-solving tasks, and propose several statistical approaches to model the data collected from collaborative interactions.
Abstract: Collaboration is generally recognized as a core competency of today's knowledge economy and has taken a central role in recent theoretical and technological developments in education research. Yet, the methodology for assessing the learning benefits of collaboration continues to rely on educational tests designed for isolated individuals. Thus, what counts as evidence of learning does not correspond to current best practices for teaching, and it does not reflect what students are ultimately expected to be able to do with their knowledge. The goals of this paper are to give an overview of the research conducted in several fields of work related to collaboration, propose a framework for the assessment of cognitive skills (such as science or math) through collaborative problem-solving tasks, and propose several statistical approaches to model the data collected from collaborative interactions. This research contributes to the knowledge needed to support a new generation of assessments based on collaboration.

46 citations


Journal ArticleDOI
TL;DR: A general program for item-response analysis using the stabilized Newton-Raphson algorithm is described in this article, which is written to be compliant with Fortran 2003 standards and is sufficiently general to handle independent variables, multidimensional ability parameters, and matrix sampling.
Abstract: A general program for item-response analysis is described that uses the stabilized Newton—Raphson algorithm. This program is written to be compliant with Fortran 2003 standards and is sufficiently general to handle independent variables, multidimensional ability parameters, and matrix sampling. The ability variables may be either polytomous or multivariate normal. Items may be dichotomous or polytomous.

45 citations


Journal Article
TL;DR: In this paper, the authors describe the foundation and rationale for a framework designed to measure reading literacy and build an assessment system that reflects current theoretical conceptions of reading and is developmentally sensitive across a prekindergarten to 12th grade student range.
Abstract: This report describes the foundation and rationale for a framework designed to measure reading literacy. The aim of the effort is to build an assessment system that reflects current theoretical conceptions of reading and is developmentally sensitive across a prekindergarten to 12th grade student range. The assessment framework is intended to document the aims of the assessment program, define the target constructs to be assessed, and describe the assessment designs that are aligned with the aims and constructs. This framework report is preliminary in that we are engaged in an iterative process of writing and revising the framework, based on what we learn from efforts to instantiate the ideas in new assessment designs and the results we garner from piloting novel designs. We also anticipate drafting further sections to address issues such as the scoring models and analytic plans once assessments have been designed and piloted.

35 citations


Journal ArticleDOI
TL;DR: In this article, it has been claimed that HSGPA is the best single predictor of college grades and that it is more equitable than test scores because of a smaller association with socioeconomic status (SES).
Abstract: Focusing on high school grade-point average (HSGPA) in college admissions may foster ethnic diversity and communicate the importance of high school performance. It has further been claimed that HSGPA is the best single predictor of college grades and that it is more equitable than test scores because of a smaller association with socioeconomic status (SES). Recent findings, however, suggest that HSPGA's seemingly smaller correlation with SES is a methodological artifact. In addition, it tends to produce systematic errors in the prediction of college grades. Although supplementing HSGPA with a high-school resource index can mitigate these errors, determining whether to include such an index in admissions decisions must take into account the institutional mission and the potential diversity impact.

Journal Article
TL;DR: The third installment of the Reading for Understanding (RfU) assessment framework is presented in this paper, where the role of performance moderators in the test design and how scenario-based assessment can be used as a tool for assessment delivery is discussed.
Abstract: This paper represents the third installment of the Reading for Understanding (RfU) assessment framework. This paper builds upon the two prior installments (Sabatini & O'Reilly, 2013; Sabatini, O'Reilly, & Deane, 2013) by discussing the role of performance moderators in the test design and how scenario-based assessment can be used as a tool for assessment delivery. Performance moderators are characteristics of students that impact reading performance but are not considered a part of the reading construct. These include (a) background and prior knowledge, (b) metacognitive and self-regulatory strategies and behavior, (c) reading strategies, and (d) student motivation and engagement. In this paper, we argue there is added value in incorporating performance moderators into a reading test design. We characterize added value with respect to the validity of the claims derived from test scores, the interpretation of the test scores, and the relevance to instruction. As a second aim, we present a case for using scenario-based assessments and how they can be used to integrate into the test design both the performance moderators as well as other features that make the assessment more instructionally relevant.


Journal ArticleDOI
TL;DR: In this paper, the authors describe the design, development, and technical adequacy of the Reading Inventory and Student Evaluation (RISE) form of the SARA reading components computer-delivered assessment.
Abstract: This paper describes the design, development, and technical adequacy of the Reading Inventory and Student Evaluation (RISE) form of the Study Aid and Reading Assessment (SARA) reading components computer-delivered assessment. The RISE form, designed for middle-school students, began through a joint project between the Strategic Education Research Partnership (SERP), ETS, and a large urban school district, where middle-school literacy had been identified as an area needing improvement. Educators were interested in understanding more about the component skills of their middle-school students, particularly their struggling readers, using an efficient and reliable assessment. To date, results from our piloting of the RISE form with middle-school students have established its technical adequacy. In the future, we plan to expand the RISE to create parallel forms for 6th-to-8th graders as well as to explore development of new test forms at other points in the grade continuum.

Journal ArticleDOI
TL;DR: The foundation and rationale for a framework designed to measure reading literacy is described, which is intended to document the aims of the assessment program, define the target constructs to be assessed, and describe the assessment designs that are aligned with the aims and constructs.
Abstract: This report describes the foundation and rationale for a framework designed to measure reading literacy. The aim of the effort is to build an assessment system that reflects current theoretical conceptions of reading and is developmentally sensitive across a prekindergarten to 12th grade student range. The assessment framework is intended to document the aims of the assessment program, define the target constructs to be assessed, and describe the assessment designs that are aligned with the aims and constructs. This framework report is preliminary in that we are engaged in an iterative process of writing and revising the framework, based on what we learn from efforts to instantiate the ideas in new assessment designs and the results we garner from piloting novel designs. We also anticipate drafting further sections to address issues such as the scoring models and analytic plans once assessments have been designed and piloted.

Journal ArticleDOI
TL;DR: In this paper, the authors reviewed the development of quantitative procedures for fairness assessment, and their efforts are reviewed in this chapter, focusing on differential prediction and differential validity procedures that examine whether test scores predict a criterion, such as performance in college, across different subgroups in a similar manner.
Abstract: ETS has been a leader in the development of quantitative procedures for fairness assessment, and its efforts are reviewed in this chapter. The first section deals with differential prediction and differential validity procedures that examine whether test scores predict a criterion, such as performance in college, across different subgroups in a similar manner. The second section, constituting the bulk of the chapter, focuses on item-level fairness, or differential item functioning. In the third section, research is considered pertaining to whether tests built to the same set of specifications produce scores that are related in the same way across different gender and ethnic groups. Limitations of the approaches are discussed in the final section.



Journal Article
TL;DR: A general program for item-response analysis using the stabilized Newton-Raphson algorithm is described in this paper, which is written to be compliant with Fortran 2003 standards and is sufficiently general to handle independent variables, multidimensional ability parameters, and matrix sampling.
Abstract: A general program for item-response analysis is described that uses the stabilized Newton—Raphson algorithm. This program is written to be compliant with Fortran 2003 standards and is sufficiently general to handle independent variables, multidimensional ability parameters, and matrix sampling. The ability variables may be either polytomous or multivariate normal. Items may be dichotomous or polytomous.

Journal ArticleDOI
TL;DR: In this paper, a new type of scoring rubric for the integrated speaking tasks of TOEFL iBT was described, and three analytic rating guides were synthesized to develop three different rating guides, each using a series of yes/no questions.
Abstract: Research and development of a new type of scoring rubric for the integrated speaking tasks of TOEFL iBT® are described. These analytic rating guides could be helpful if tasks modeled after those in TOEFL iBT were used for formative assessment, a purpose which is different from TOEFL iBT's primary use for admission decisions. Two questions motivated the project: What can be done to make the criteria and standards for good performance clear not only to English language teachers and their students but also to novice raters? How can test-takers be guided in the steps necessary to improve performance? Previous research, quantitative results of linguistic features present in spoken responses, and qualitative themes of raters' judgments were analyzed. Salient features associated with performance were then synthesized to develop three analytic rating guides, each using a series of yes/no questions. The rating guides expanded the current scoring scales for the three dimensions of delivery, language use, and topic development, so that key features of increasingly proficient speaking performance were described in more detail. Suggestions for future validation studies are provided in terms of an iterative process of trials and revisions.

Journal ArticleDOI
TL;DR: In this article, the authors investigated the value of reporting the reading, listening, speaking, and writing section scores for the TOEFL iBT® test, focusing on four related aspects of the psychometric quality of the ToEFL IBT section scores: reliability of the section scores, dimensionality of the test, presence of distinct score profiles, and section scores' generalizability for norm-referenced decisions as well as the dependability of criterion-refereced decisions for international student admission.
Abstract: This study investigates the value of reporting the reading, listening, speaking, and writing section scores for the TOEFL iBT® test, focusing on 4 related aspects of the psychometric quality of the TOEFL iBT section scores: reliability of the section scores, dimensionality of the test, presence of distinct score profiles, and the section scores' generalizability for norm-referenced decisions as well as the dependability of criterion-referenced decisions for international student admission. Four operational TOEFL iBT test forms were analyzed for all examinees as well as for 3 native language (L1) groups (Arabic, Korean, and Spanish). Haberman's (2008) subscore analysis suggested that the speaking section score had added value due to its relative distinctness from the other modalities. Consistent with the subscore analysis results, a series of exploratory factor analyses (EFAs) indicated the possibility of the presence of 2 correlated factors—a reading/listening/writing factor and a speaking factor. In contrast, the CFAs conducted separately for the 3 L1 groups as well as a multiple-group confirmatory factor analyses (CFAs) identified a correlated 4-factor model with reading, listening, speaking, and writing factors as the best representation of the structure of the entire test for all examinees as well as for the 3 L1 groups. Reliability of the observed section scores for norm-referenced score interpretations and the dependability of classification decisions made based on different cut scores were generally satisfactory while they were also found to be relatively low in some circumstances. Based on the mixed results concerning the value-added information the TOEFL iBT section scores provide, recommendations for future research directions and some key issues of consideration for high-stakes decision making based on the section scores were summarized.

Journal ArticleDOI
TL;DR: The history of item response theory can be traced back to the early 1950s by Lord and Novick (Statistical Theories of Mental Test Scores, Addison-Wesley, Reading, 1968).
Abstract: Few would doubt that researchers at ETS have contributed more to the general topic of item response theory (IRT) than individuals from any other institution. In this chapter, we review most of those contributions, dividing them into sections by decades of publication. The history of IRT begins before the seminal volume by Lord and Novick (Statistical Theories of Mental Test Scores, Addison-Wesley, Reading, 1968) and ETS researchers were central contributors to those developments, beginning with early work by Fred Lord and Bert Green in the 1950s. The chapter traces a wide range of contributions through the decades, ending with recent work that produced models involving complex latent variable structures and multiple dimensions.

Journal Article
TL;DR: In this article, the authors gather and review relevant higher education frameworks, determine their commonalities and create domains, and identify the assessments that Educational Testing Service (ETS) has developed in each of the domains.
Abstract: The public, education, and workforce sectors all have expressed interest regarding the key knowledge, skills, and abilities that enable individuals to be productive members of society. Although past efforts have attempted to create frameworks of student learning outcomes, the results have varied due to different perspectives and goals. Thus, the purpose of this paper was to gather and review relevant higher education frameworks, determine their commonalities and create domains, and identify the assessments that Educational Testing Service (ETS) has developed in each of the domains. After a thorough review of the relevant frameworks, seven key domains were identified: creativity, critical thinking, teamwork, communication, digital and information literacy, citizenship, and life skills. Also discussed were the issues of education versus work contextualization and the assumption of foundational quantitative reasoning and literacy skills informing these seven domains.

Journal ArticleDOI
TL;DR: In this paper, the authors gather and review relevant higher education frameworks, determine their commonalities and create domains, and identify the assessments that Educational Testing Service (ETS) has developed in each of the domains.
Abstract: The public, education, and workforce sectors all have expressed interest regarding the key knowledge, skills, and abilities that enable individuals to be productive members of society. Although past efforts have attempted to create frameworks of student learning outcomes, the results have varied due to different perspectives and goals. Thus, the purpose of this paper was to gather and review relevant higher education frameworks, determine their commonalities and create domains, and identify the assessments that Educational Testing Service (ETS) has developed in each of the domains. After a thorough review of the relevant frameworks, seven key domains were identified: creativity, critical thinking, teamwork, communication, digital and information literacy, citizenship, and life skills. Also discussed were the issues of education versus work contextualization and the assumption of foundational quantitative reasoning and literacy skills informing these seven domains.

Journal ArticleDOI
TL;DR: The Cognitively Based Assessment of, for, and as Learning (CBAL) science competency model and its related learning progressions were developed by applying the CBAL approach (Bennett & Gitomer, 2009) to the domain of middle school science.
Abstract: The purpose of this report is to describe a science competency model and 3 related learning progressions, which were developed by applying the CBAL™ approach (Bennett & Gitomer, 2009) to the domain of middle school science. The Cognitively Based Assessment of, for, and as Learning (CBAL) science competency model and its related learning progressions were developed by reviewing existing literature on learning sciences and science education, which have placed increasing emphasis on learners' knowledge and ability to apply scientific knowledge to conduct evidence-based reasoning. In this report, we present the 5 competencies in our science competency model that reflect current efforts in the Next Generation Science Standards and the recent reform-based curriculum to promote integrated and generative understanding. In addition, we report 3 hypothesized learning progressions related to our competency model to define the increasing sophistication of both content understanding and the capacity to carry out scientific inquiry. Then we discuss features of assessment prototypes developed under the guidance of the competency model and the learning progressions, by illustrating parts of 1 sample formative assessment task prototype.

Journal ArticleDOI
TL;DR: In this article, Yao's rater model was used to investigate the properties of a newly proposed method (Yao's Rater model) for modeling rater severity and its distribution under different conditions.
Abstract: The current study used simulated data to investigate the properties of a newly proposed method (Yao's rater model) for modeling rater severity and its distribution under different conditions. Our study examined the effects of rater severity, distributions of rater severity, the difference between item response theory (IRT) models with rater effect and without rater effect, and the difference between the precision of the ability estimates for tests composed of only constructed-response (CR) items and for tests composed of multiple-choice (MC) and CR items combined. Our results indicate that rater severity and its distribution can increase the bias of examinees' ability estimates and lower test reliability. Moreover, using an IRT model with rater effects can substantially increase the precision in the examinees' ability estimates, especially when the test was composed of only CR items. We also compared Yao's rater model with Muraki's rater effect model (1993) in terms of ability estimation accuracy and rater parameter recovery. The estimation results from Yao's rater model using Markov chain Monte Carlo (MCMC) were better than those from Muraki's rater effect model using marginal maximum likelihood.

Journal ArticleDOI
TL;DR: The authors investigated the 10-year trends of community college students' performance in reading, writing, mathematics, and critical thinking, as assessed by the ETS® Proficiency Profile (EPP), an assessment of collegelevel learning outcomes.
Abstract: Community colleges currently enroll about 44% of the undergraduate students in the United States and are rapidly expanding. It is of critical importance to obtain direct evidence of student learning to see if students receive adequate training at community colleges. This study investigated the 10-year trends of community college students' (n = 46,403) performance in reading, writing, mathematics, and critical thinking, as assessed by the ETS® Proficiency Profile (EPP), an assessment of college-level learning outcomes. Results showed that community college students caught up with and significantly outperformed students from liberal arts colleges by the end of the 10-year period and made significant improvement in critical-thinking skills. An increasing gender gap was observed in mathematics at community colleges. Prevalent ethnic minority and English as a second language (ESL) gaps were noted but gaps between ESL and non-ESL students and between Hispanic and White students were decreasing. Additionally, Asian students at community colleges showed an overall decline in performance. Findings from this study provide significant implications for community college leaders, researchers, and policymakers.

Journal ArticleDOI
TL;DR: The SuccessNavigator assessment as discussed by the authors is an online, 30 minute self-assessment of psychosocial and study skills designed for students entering postsecondary education and is designed to predict a range of early academic outcomes.
Abstract: The SuccessNavigator™ assessment is an online, 30 minute self-assessment of psychosocial and study skills designed for students entering postsecondary education. In addition to providing feedback in areas such as classroom and study behaviors, commitment to educational goals, management of academic stress, and connection to social resources, it is also designed to predict a range of early academic outcomes. By indicating students' likely success, advisors, faculty, and staff can target their interactions with students to increase their likelihood of success. This report outlines evidence of reliability, validity, and fairness to demonstrate the appropriateness of SuccessNavigator for these purposes.

Journal Article
TL;DR: In this article, the authors report the results of two quasiexperimental studies conducted to examine the efficacy of a new time management intervention designed for high school students, and they report that there was no difference between the treatment and control groups in improvement in self-reported time management skills as a result of the intervention, however, the treatment group reported significantly greater improvement than the control group for secondary outcomes such as stress, anxiety, depression and knowledge of time management strategies.
Abstract: The current paper reports the results of 2 quasiexperimental studies conducted to examine the efficacy of a new time management intervention designed for high school students. In both studies, there was no difference between the treatment and control groups in improvement in self-reported time management skills as a result of the intervention. However, the treatment group reported significantly greater improvement than the control group for secondary outcomes such as stress (Studies 1 and 2), anxiety (Studies 1 and 2), depression (Study 1), and knowledge of time management strategies (Study 1). Additionally, advisor ratings of student time management skills were higher for the treatment than for the control group in Study 2. Implications and suggestions for improving the intervention are discussed.

Journal ArticleDOI
TL;DR: The authors conducted a cognitive interview study to investigate the validity of content knowledge for teaching (CKT) assessments and found evidence that the reasoning used by the participant represents the underlying knowledge and skill domain we intend to measure through CKT assessments.
Abstract: This report provides a description of a cognitive interview study investigating validity of assessments designed to measure content knowledge for teaching (CKT). The report is intended both to provide information on the validity of the CKT measures and to provide guidance to researchers interested in replicating the design. The study takes an argument-based approach to investigating validity by first articulating interpretive arguments that are central to the CKT measurement theory and then using the cognitive interview data to evaluate these arguments (Kane, 2006). The study is based on 30 interviews of elementary mathematics teachers and 30 interviews of elementary English language arts teachers. Teachers were selected using previous CKT assessment scores to represent high- and low-scoring groups for each subject. The cognitive interviews were conducted separately for each subject and responses were coded and then analyzed to investigate the scoring and extrapolation inferences for the validity argument. Findings strongly support the scoring inference, providing evidence that the item keying for the items is correct. Results also indicate that the participants reasoned about the item in ways that conformed with the reasoning outlined in the task design rationales (TDR) for each item. These TDRs represent what reasoning should look like for each of these items for a respondent drawing on the desired CKT knowledge. As such, conformity with the TDRs supports the extrapolation inference, providing evidence that the reasoning used by the participant represents the underlying knowledge and skill domain we intend to measure through CKT assessments. The study design, instruments, methods, and results are described in detail, with discussion included to support researchers interested in replicating or capitalizing on the study design.