scispace - formally typeset
Search or ask a question

Showing papers in "Language Testing in 2018"


Journal ArticleDOI
TL;DR: In this article, the authors argue for the integration of the construct of interactional competence (IC) in the assessment of speaking, arguing that a psycholinguistically based speaking construct has predominated.
Abstract: In the assessment of speaking, a psycholinguistically based speaking construct has predominated. In this paper, we argue for the integration of the construct of interactional competence (IC) in spe...

81 citations


Journal ArticleDOI
TL;DR: This article explored the constructs that underpin three different measures of vocabulary knowledge and investigated the degree to which these three measures correlate with, and are able to predict, the degree of knowledge knowledge.
Abstract: This study explores the constructs that underpin three different measures of vocabulary knowledge and investigates the degree to which these three measures correlate with, and are able to predict, ...

81 citations


Journal ArticleDOI
TL;DR: This article identified and explored the dominant methods for evaluating rating quality within the context of research on large-scale rater-mediated language assessments and highlighted the reliance upon aggregate-level information that is not specific to individual raters or sp...
Abstract: The use of assessments that require rater judgment (i.e., rater-mediated assessments) has become increasingly popular in high-stakes language assessments worldwide. Using a systematic literature review, the purpose of this study is to identify and explore the dominant methods for evaluating rating quality within the context of research on large-scale rater-mediated language assessments. Results from the review of 259 methodological and applied studies reveal an emphasis on inter-rater reliability as evidence of rating quality that persists across methodological and applied studies, studies primarily focused on rating quality and studies not primarily focused on rating quality, and across multiple language constructs. Additional findings suggest discrepancies in rating designs used in empirical research and practical concerns in performance assessment systems. Taken together, the findings from this study highlight the reliance upon aggregate-level information that is not specific to individual raters or sp...

58 citations


Journal ArticleDOI
TL;DR: The ability to interact with others has gained recognition as part of the L2 speaking construct in the assessment literature and in high and low-stakes speaking assessments as mentioned in this paper, and it has been recognized as one of the strengths of L2.
Abstract: The ability to interact with others has gained recognition as part of the L2 speaking construct in the assessment literature and in high- and low-stakes speaking assessments. This paper first prese...

51 citations


Journal ArticleDOI
TL;DR: In this article, the authors present an analysis of how the concerns of language testers can be conceptualized in terms used to construct a validity argument, and the relevance of research about the rating of test performances extends beyond one or two inferences about rater reliabilit...
Abstract: Argument-based validation requires test developers and researchers to specify what is entailed in test interpretation and use. Doing so has been shown to yield advantages (Chapelle, Enright, & Jamieson, 2010), but it also requires an analysis of how the concerns of language testers can be conceptualized in the terms used to construct a validity argument. This article presents one such analysis by examining how issues associated with the rating of test takers’ linguistic performance can be included in a validity argument. Through a manual search of published language testing research, we gathered examples of research studies investigating the quality of rating processes and products. We then analyzed them in terms of how the research could be framed within a validity argument. Drawing on Kane’s (2001, 2006, 2013) conceptualization of inferences, warrants, and assumptions, we show that the relevance of research about the rating of test performances extends beyond one or two inferences about rater reliabilit...

50 citations


Journal ArticleDOI
TL;DR: This article examined the predictive validity of TOEFL iBT with respect to academic achievement as measured by the first-year grade point average (GPA) of Chinese students at Purdue University, a large, public, Research I institution in Indiana, USA.
Abstract: This study examines the predictive validity of the TOEFL iBT with respect to academic achievement as measured by the first-year grade point average (GPA) of Chinese students at Purdue University, a large, public, Research I institution in Indiana, USA. Correlations between GPA, TOEFL iBT total and subsection scores were examined on 1990 mainland Chinese students enrolled across three academic years (N2011 = 740, N2012 = 554, N2013 = 696). Subsequently, cluster analyses on the three cohorts’ TOEFL subsection scores were conducted to determine whether different score profiles might help explain the correlational patterns found between TOEFL subscale scores and GPA across the three student cohorts. For the 2011 and 2012 cohorts, speaking and writing subscale scores were positively correlated with GPA; however, negative correlations were observed for listening and reading. In contrast, for the 2013 cohort, the writing, reading, and total subscale scores were positively correlated with GPA, and the negative co...

42 citations


Journal ArticleDOI
TL;DR: An overview of the automated speech scoring system SpeechRaterSM is provided and how to use charts and evaluation statistics to monitor and evaluate automated scores and human rater scores of spoken constructed responses is provided.
Abstract: As automated scoring systems for spoken responses are increasingly used in language assessments, testing organizations need to analyze their performance, as compared to human raters, across several dimensions, for example, on individual items or based on subgroups of test takers. In addition, there is a need in testing organizations to establish rigorous procedures for monitoring the performance of both human and automated scoring processes during operational administrations. This paper provides an overview of the automated speech scoring system SpeechRaterSM and how to use charts and evaluation statistics to monitor and evaluate automated scores and human rater scores of spoken constructed responses.

41 citations


Journal ArticleDOI
TL;DR: This paper presented a series of workshops on language assessment with Haitian teachers in the spring of 2013, and the final products of these workshops were several revised versions of the workshop's final report.
Abstract: Research was conducted during the delivery of a series of workshops on language assessment with Haitian teachers in the spring of 2013. The final products of these workshops were several revised na...

40 citations


Journal ArticleDOI
TL;DR: The authors investigated test-takers' processing while completing banked gap-fill tasks, designed to test reading proficiency, in order to test theoretically based expectations about the variation in cognitive processes of testtakers across levels of performance.
Abstract: This study investigates test-takers’ processing while completing banked gap-fill tasks, designed to test reading proficiency, in order to test theoretically based expectations about the variation in cognitive processes of test-takers across levels of performance. Twenty-eight test-takers’ eye traces on 24 banked gap-fill items (on six tasks) were analysed according to seven on-line eye-tracking measures representing overall, text and task processing. Variation in processing was related to test-takers’ level of performance on the tasks overall. In particular, as hypothesised, lower-scoring students exerted more cognitive effort on local reading and lower-level cognitive processing in contrast to test-takers who attained higher scores. The findings of different cognitive processes associated with variation in scores illuminate the construct measured by banked gap-fill items, and therefore have implications for test design and the validity of score interpretations.

37 citations


Journal ArticleDOI
TL;DR: This paper developed an L2 English comprehensibility scale targeting the degree of perceived listener effort required for understanding L2 speech, which was used to guide teachers on what to focus on in instruction in order to target more effectively the linguistic factors that matter most for being understood and to raise learners' awareness about their abilities.
Abstract: There is growing research on the linguistic features that most contribute to making second language (L2) speech easy or difficult to understand. Comprehensibility, which is usually captured through listener judgments, is increasingly viewed as integral to the L2 speaking construct. However, there are shortcomings in how this construct is operationalized in L2 speaking proficiency scales. Moreover, teachers and learners have little practical means of benefiting from research pinpointing the properties of learners’ oral performance that optimize or hinder their ability to be understood. There is thus the need for a tool to guide teachers on what to focus on in instruction in order to target more effectively the linguistic factors that matter most for being understood and to raise learners’ awareness about their abilities. To address this gap, this article reports on the development of an L2 English comprehensibility scale targeting the degree of perceived listener effort required for understanding L2 speech...

37 citations


Journal ArticleDOI
TL;DR: The authors investigated whether professional and non-professional raters with a broad exposure to L2 speech demonstrate similar responsiveness to fluency and linguistic accuracy in an occupational context, and found that raters' knowledge and experience may influence their ratings, both in terms of leniency and varied focus on different aspects of speech.
Abstract: It is general practice to use rater judgments in speaking proficiency testing. However, it has been shown that raters’ knowledge and experience may influence their ratings, both in terms of leniency and varied focus on different aspects of speech.The purpose of this study is to identify raters’ relative responsiveness to fluency and linguistic accuracy in an occupational context, and to investigate whether professional and non-professional raters with a broad exposure to L2 speech demonstrate similar responsiveness to these two aspects. To this end, an experimental approach was applied. Fluency and accuracy were separated and systematically manipulated. As it is known that foreign accentedness of speech influences raters’ judgments, this factor was accounted for. Seventeen responses to a Dutch L2 exam in a vocational context were converted into four different versions manipulated for morpho-syntactical accuracy and/or fluency, and read by a Dutch L2 actor, resulting in 68 stimuli. Fifty-five professional ...

Journal ArticleDOI
TL;DR: The authors provide support for continued scrutiny of interactional competence (IC) as an important component of the speaking construct, and discuss the challenges associated with including IC in the spoken construct and the implications of studies in this special issue for the relationship between IC and proficiency.
Abstract: The papers in this special issue provide support for continued scrutiny of interactional competence (IC) as an important component of the speaking construct. The contributions underscore the complex nature of IC and remind us of the multiple factors that affect any construct definition. At the same time, each study offers insights into those factors through their explorations of IC. In this final paper, we first briefly review key findings from the papers that confirm what is already known about IC and that provide new information to our understanding of the construct of IC. After summarizing points of convergence and of divergence, we turn to a discussion of areas that require additional targeted attention and offer four generalizations as starting points for research. In the final section, we take a critical look at the challenges associated with including IC in the speaking construct and the implications of the studies in this special issue for the relationship between IC and proficiency.

Journal ArticleDOI
TL;DR: The authors examined the influence of reading and listening input on the ability of test-takers to formulate their oral responses in integrated speaking test tasks (integrated tasks) and found that reading and/or listening input can provide reading and or listening input to serve as the basis for testtakers' responses.
Abstract: Integrated speaking test tasks (integrated tasks) provide reading and/or listening input to serve as the basis for test-takers to formulate their oral responses. This study examined the influence o...

Journal ArticleDOI
TL;DR: The assumption that even if language proficiency does not determine academic success, a certain proficiency level is still required is challenged by as discussed by the authors, who argue that language proficiency is not determinant of academic success.
Abstract: University entrance language tests are often administered under the assumption that even if language proficiency does not determine academic success, a certain proficiency level is still required. ...

Journal ArticleDOI
TL;DR: In this article, the authors identify what aviation experts consider to be the key features of effective communication by examining in detail their commentary on a 17-minute segment of recorded radiotelephon...
Abstract: This paper aims to identify what aviation experts consider to be the key features of effective communication by examining in detail their commentary on a 17-minute segment of recorded radiotelephon...

Journal ArticleDOI
TL;DR: The authors investigated the children's attentional foci on different test components (e.g., prompts, pictures, and a countdown timer) by means of their eye movements and found that NNS tended to fixate longer on and looked more frequently at the countdown timer than their NS peers, who were more lik...
Abstract: We investigated how young language learners process their responses on and perceive a computer-mediated, timed speaking test. Twenty 8-, 9-, and 10-year-old non-native English-speaking children (NNSs) and eight same-aged, native English-speaking children (NSs) completed seven computerized sample TOEFL® Primary™ speaking test tasks. We investigated the children’s attentional foci on different test components (e.g., prompts, pictures, and a countdown timer) by means of their eye movements. We associated the children’s eye-movement indices (visit counts and fixation durations) with spoken performance. The children provided qualitative data (interviews; picture-drawings) on their test experiences as well. Results indicated a clear contrast between NNSs and NSs in terms of speech production (large score differences) as expected. More interestingly, the groups’ eye-movement patterns differed. NNSs tended to fixate longer on and looked more frequently at the countdown timer than their NS peers, who were more lik...

Journal ArticleDOI
TL;DR: In this paper, the issue of standardization in L2 oral testing is discussed and some countries opt for test-takers' own teachers as examiners in some countries, where external examiners are frequently used globally.
Abstract: The present paper looks at the issue of standardization in L2 oral testing. Whereas external examiners are frequently used globally, some countries opt for test-takers’ own teachers as examiners in ...

Journal ArticleDOI
TL;DR: The authors used a test format that standardizes the interlocutor's linguistic and interactional contributions to the exchange to assess interactional performance of pre-vocational learners, and reported on the extent to which these tasks can be used to assess L2 speakers' interactional abilities in a reliable and valid manner.
Abstract: This article explores ways to assess interactional performance, and reports on the use of a test format that standardizes the interlocutor’s linguistic and interactional contributions to the exchange. It describes the construction and administration of six scripted speech tasks (instruction, advice, and sales tasks) with pre-vocational learners (n = 34), and reports on the extent to which these tasks can be used to assess L2 speakers’ interactional performance in a reliable and valid manner.The high levels of agreement found between three independent raters on both holistic and analytical measurements of interactional performance indicate that this construct can be measured reliably with these tasks. Means and standard deviations demonstrate that tasks differentiate between speakers’ interactional performance. Holistic ratings of linguistic accuracy and interactional ability correlate highly between tasks that focus on different language functions, and are situated in different interactional domains. Furthermore, positive correlations are found between both holistic and analytic ratings of oral performance and vocabulary size. Positive within-task correlations between analytical ratings of specific interactional strategies and holistic ratings of overall interactional ability show that analytic ratings of meaning negotiation and correcting misinterpretation provide additional information about speakers’ interactional ability that is not captured by holistic assessment alone.It is concluded that these tasks are a useful diagnostic tool for practitioners to support their learners’ interactional abilities at a sub-skill level.

Journal ArticleDOI
TL;DR: This paper examined the reliability of the reading, listening, speaking, and writing section scores for the TOEFL iBT® test and their interrelationship in order to collect empirical evidence.
Abstract: The present study examined the reliability of the reading, listening, speaking, and writing section scores for the TOEFL iBT® test and their interrelationship in order to collect empirical evidence...

Journal ArticleDOI
TL;DR: The authors investigated whether test takers' breadth and depth of vocabulary knowledge can contribute to their efficient use of lexical bonds while restoring damaged texts in reducible text restoration task, and they found that test taker's depth and breadth of knowledge contributed to their efficiency in restoring damaged text.
Abstract: The present study intended to investigate whether test takers’ breadth and depth of vocabulary knowledge can contribute to their efficient use of lexical bonds while restoring damaged texts in redu...

Journal ArticleDOI
TL;DR: Interactional competence has been variously defined as turn-taking ability, paralinguistic features of communication such as eye contact, gesture, and gesticulation, and listener responses as mentioned in this paper.
Abstract: Interactional competence has been variously defined as turn-taking ability, paralinguistic features of communication such as eye contact, gesture, and gesticulation, and listener responses. In exis...

Journal ArticleDOI
TL;DR: K-12 English language proficiency tests that assess multiple content domains (e.g., listening, speaking, reading, writing) often have subsections based on these content domains; scores assigned to...
Abstract: K–12 English language proficiency tests that assess multiple content domains (e.g., listening, speaking, reading, writing) often have subsections based on these content domains; scores assigned to ...

Journal ArticleDOI
TL;DR: The Test of German as a Foreign Language (TestDaF1) as discussed by the authors is a standardized test of German language proficiency, which has been used for admission to German universities since 2001.
Abstract: The German university setting has experienced a dramatic change over the past several decades with respect to students entering from abroad. In 2015, international students comprised 11.9% of all students enrolled in public universities, and recent global developments (most notably the massive migration of refugees into Germany) have resulted in rapidly evolving demands on institutions of higher education (g.a.s.t., 2016). Greater numbers of students from different countries of origin, with diverse educational backgrounds and distinct learning needs, are seeking admittance to the low-cost and highly regarded German university system. A key concern for this heterogeneous group of students is the extent to which they are prepared to participate in courses of study and university life, the primary language for which is German. It is within this milieu that the Test of German as a Foreign Language (TestDaF1) plays a critical role as a standardized test of German language proficiency. Developed and administered by the Society for Academic Study Preparation and Test Development (g.a.s.t.2), TestDaF was launched in 2001 and has experienced persistent annual growth, with more than 44,000 test takers in 2016 (a 16% increase over the previous year; g.a.s.t., 2017). Of note, and in keeping with the motto of “Study successfully in German”, TestDaF is one of a suite of products and services offered by g.a.s.t. and intended to facilitate access to German university studies, including the following: an Internet-based platform for individualized language learning (Deutsch-Uni Online or DUO); an online assessment for placing students into foreign language courses (onSET); and a university aptitude assessment, the Test for Academic Studies (TestAS). On its website, g.a.s.t. provides ample information regarding all of these products for different audiences, including test takers, test centers, universities, and teachers. This information is also made available in 20 languages, and potential test takers can even complete a brief automated C-test and receive feedback regarding their chances of passing TestDaF successfully.3 While the current review focuses on TestDaF per se, the presentation of this test as one part of an overall effort to support and facilitate international student access to German university studies reflects an important public service dimension underlying the assessment. 715848 LTJ0010.1177/0265532217715848Language TestingTest Review research-article2017

Journal ArticleDOI
TL;DR: The claims articulated through the development process and evidence collected throughout development and pilot testing enable a wide-ranging, comparative evaluation of five- and 10-item TOEFL Primary Reading screener tests that systematically incorporate the concepts of measurement quality, impact, and practicality.
Abstract: In this study, we define the term screener test, elaborate key considerations in test design, and describe how to incorporate the concepts of practicality and argument-based validation to drive an ...

Journal ArticleDOI
TL;DR: The DELF (Diplôme d'études en langue française) and DALF tests as mentioned in this paper have been used to certify the French competence of non-French citizens or of French citizens from non-francophone countries who have not completed a French secondary or higher education diploma.
Abstract: Recent estimates indicate that French is spoken as a first or additional language by over 220 million people. French is an official language in 29 countries and in many organizations such as the United Nations and the Red Cross. French is also, after English, the most widely taught language in educational systems around the world, with an estimated 120 million students and 500,000 teachers. It is hardly surprising then that there is a strong international demand for official certification of French competence and that a range of tests are on offer to meet this goal. Among the recognized tests available for this purpose are the DELF (Diplôme d’études en langue française) and DALF (Diplôme approfondi de langue française). These are official qualifications awarded by the French Ministry of Education to certify the French competence of non-French citizens or of French citizens from non-francophone countries who have not completed a French secondary or higher education diploma. There are six independent diplomas: three for children or adolescents (DELF Prim, DELF Junior and DELF Scolaire) and three for adults (DELF tout public, a general proficiency qualification for those over 16 years of age, DELF Pro, a work-related test for those seeking initial employment opportunities or promotion, and DALF for higher level candidates). Each test is oriented to the CEFR scale with DELF Prim pitched at the preA1–A2 levels for immigrants with limited literacy backgrounds, the other DELF tests spanning the A1 to B2 levels and the DALF assessing proficiency at the more advanced C1 and C2 levels. Each test battery covers the four skill components of Listening, Speaking, Reading and Writing.

Journal ArticleDOI
TL;DR: The authors examine the construct of interactional competence and examine data obtained from a variety of contexts, languages, and proficiency levels, including classroom, professional settings, in group oral exams, and in telephone-delivered exams.
Abstract: Perhaps more than any other skill area in second language testing, the assessment of speaking has seen substantial changes. Speaking test formats have evolved from dictations and face-to-face scripted interviews (e.g., UCLES Certificate of Proficiency in English, United States Foreign Service Institute) of the early twentieth century to the telephone-delivered and computer-based ACTFL Oral Proficiency Interviews (OPIs), as well as the semi-direct speaking tests such as TOEFL iBT® and the fully automatic Pearson PTE ProfessionalTM that are available today (for historical summaries, see Brooks, 2017; Fulcher, 2003; McNamara, 1996). While technological innovations continue to influence speaking test formats, the adoption of the model of communicative competence (Canale & Swain, 1980) in language teaching and learning has resulted in more paired and group work formats in language assessment (Taylor & Wigglesworth, 2009). Consequently, increased attention has turned to investigating interaction and its role in the speaking construct (cf. Berry, 2007; Brooks, 2009; Davis, 2009; Ducasse & Brown, 2009; Galaczi, 2008; May, 2009, 2011; Nakatsuhara, 2006; Nakatsuhara, 2011). The papers in this special issue target the construct of interactional competence (IC) and examine data obtained from a variety of contexts, languages, and proficiency levels. The researchers explore IC in the classroom, in professional settings, in group oral exams, and in telephone-delivered exams. Speakers represent such structurally diverse languages as Arabic, Chinese, English, Japanese, Korean, and Russian. The studies in this special issue not only contribute to earlier work in IC but also advance the field by opening Pandora’s Box (McNamara, 1996) a little more. Issues addressed include: The relationship between the construct of interactional competence and that of speaking proficiency; the relationship between IC and proficiency level; listener responses as distinct