Showing papers in &quot;Language Testing in 2007&quot;

The relative significance of syntactic knowledge and vocabulary breadth in the prediction of reading comprehension test performance

TL;DR: This paper explains why D is affected by text length, and demonstrates with an extensive empirical analysis that the effects of text length are significant over certain ranges, which are identified.

...read moreread less

Abstract: A reliable index of lexical diversity (LD) has remained stubbornly elusive for over 60 years. Meanwhile, researchers in fields as varied as stylistics, neuropathology, language acquisition, and even forensics continue to use flawed LD indices — often ignorant that their results are questionable and in some cases potentially dangerous. Recently, an LD measurement instrument known as vocd has become the virtual tool of the LD trade. In this paper, we report both theoretical and empirical evidence that calls into question the rationale for vocd and also indicates that its reliability is not optimal. Although our evidence shows that vocd's output (D) is a relatively robust indicator of the aggregate probabilities of word occurrences in a text, we show that these probabilities — and thus also D — are affected by text length. Malvern, Richards, Chipere and Duran (2004) acknowledge that D (as calculated by vocd's default method) can be affected by text length, but claim that the effects are not significant for t...

...read moreread less

238 citations

Journal Article•DOI•

[...]

Toshihiko Shiotsu¹, Cyril J. Weir²•Institutions (2)

Kurume University¹, University of Bedfordshire²

`I want to go back to the text': Response strategies on the reading subtest of the new TOEFL®

TL;DR: This article examined the contribution of knowledge of syntax and knowledge of vocabulary to L2 reading in two pilot studies in different contexts and found that syntactic knowledge over vocabulary knowledge in predicting performance on a text reading comprehension test.

...read moreread less

Abstract: In the componential approach to modelling reading ability, a number of contributory factors have been empirically validated. However, research on their relative contribution to explaining performance on second language reading tests is limited. Furthermore, the contribution of knowledge of syntax has been largely ignored in comparison with the attention focused on vocabulary. This study examines the relative contribution of knowledge of syntax and knowledge of vocabulary to L2 reading in two pilot studies in different contexts ‐ a heterogeneous population studying at the tertiary level in the UK and a homogenous undergraduate group in Japan ‐ followed by a larger main study, again involving a homogeneous Japanese undergraduate population. In contrast with previous findings in the literature, all three studies offer support for the relative superiority of syntactic knowledge over vocabulary knowledge in predicting performance on a text reading comprehension test. A case is made for the robustness of structural equation modelling compared to conventional regression in accounting for the differential reliabilities of scores on the measures employed.

...read moreread less

222 citations

Journal Article•DOI•

[...]

Andrew D. Cohen¹, Thomas A. Upton²•Institutions (2)

University of Minnesota¹, Indiana University – Purdue University Indianapolis²

Construct implications of including still image or video in computer-based listening tests

TL;DR: The authors described reading and test-taking strategies that test takers used on the ''Reading' section of the LanguEdge Courseware (2002) materials developed to familiarize prospective responde...

...read moreread less

Abstract: This study describes the reading and test-taking strategies that test takers used on the `Reading' section of the LanguEdge Courseware (2002) materials developed to familiarize prospective responde...

...read moreread less

154 citations

Journal Article•DOI•

[...]

Gary J. Ockey¹•Institutions (1)

University of New Mexico¹

Evaluating rater responses to an online training program for L2 writing assessment

TL;DR: The authors compared a series of still images to video in academic computer-based tests to determine how test takers engage with these two test modes, and found that test-takers engage differently with the two modes of delivery.

...read moreread less

Abstract: Over the past decade, listening comprehension tests have been converting to computer-based tests that include visual input. However, little research is available to suggest how test takers engage with different types of visuals on such tests. The present study compared a series of still images to video in academic computer-based tests to determine how test takers engage with these two test modes. The study, which employed observations, retrospective reports and interviews, used data from university-level non-native speakers of English. The findings suggest that test takers engage differently with these two modes of delivery. Specifically, while test takers engaged minimally and similarly with the still images, there was wide variation in the ways and degree to which they engaged with the video stimulus. Implications of the study are that computer-based tests of listening comprehension could include still images with only minimally altering the construct that is measured by audio-only listening tests, but ...

...read moreread less

136 citations

Journal Article•DOI•

[...]

Catherine Elder¹, Gary Barkhuizen², Ute Knoch², Janet von Randow²•Institutions (2)

University of Melbourne¹, University of Auckland²

Developing a Pragmatics Test for Chinese EFL Learners.

TL;DR: In this article, the authors explored the effect of online rater self-training in relation to an analytically-scored academic writing task designed to diagnose undergraduates' English learning needs.

...read moreread less

Abstract: The use of online rater self-training is growing in popularity and has obvious practical benefits, facilitating access to training materials and rating samples and allowing raters to reorient themselves to the rating scale and self monitor their behaviour at their own convenience. However there has thus far been little research into rater attitudes to training via this modality and its effectiveness in enhancing levels of inter- and intra-rater agreement. The current study explores these issues in relation to an analyticallyscored academic writing task designed to diagnose undergraduates’ English learning needs. 8 ESL raters scored a number of pre-rated benchmark writing samples online and received immediate feedback in the form of a discrepancy score indicating the gap between their own rating of the various categories of the rating scale and the official ratings assigned to the benchmark writing samples. A batch of writing samples was rated twice (before and after participating in the online training) by each rater and Multifaceted Rasch analyses were used to compare levels of rater agreement and rater bias (on each analytic rating category). Raters’ views regarding the effectiveness of the training were also canvassed. While findings revealed limited overall gains in reliability, there was considerable individual variation in receptiveness to the training input. The paper concludes with suggestions for refining the online training program and for further research into factors influencing rater responsiveness.

...read moreread less

110 citations

Journal Article•DOI•

[...]

Liu Jian-da¹•Institutions (1)

Guangdong University of Foreign Studies¹

Construct validation of analytic rating scales in a speaking assessment: Reporting a score profile and a composite

TL;DR: In this article, the gap between the teaching and test syllabi in EFL has been identified as a major barrier for EFL teaching and testing syllabi, but the corresponding tests still focus on linguistic competence.

...read moreread less

Abstract: Pragmatic proficiency has been incorporated in the EFL teaching and testing syllabi in China, but the corresponding tests still focus on linguistic competence. The gap between the teaching and test...

...read moreread less

82 citations

Journal Article•DOI•

[...]

Yasuyo Sawaki

A conversation-analytic hermeneutic rating protocol to assess L2 oral pragmatic competence:

TL;DR: In this article, confirmatory factor analysis (CFA) and multivariate generalizability theory (G theory) were combined to analyze the responses of 214 admits to a study-abroad program to two role-play speaking tasks in a Spanish speaking assessment designed for student placement and diagnosis.

...read moreread less

Abstract: This is a construct validation study of a second language speaking assessment that reported a language profile based on analytic rating scales and a composite score. The study addressed three key issues: score dependability, convergent/discriminant validity of analytic rating scales and the weighting of analytic ratings in the composite score. Confirmatory factor analysis (CFA) and multivariate generalizability theory (G theory) were combined to analyze the responses of 214 admits to a study-abroad program to two role-play speaking tasks in a Spanish speaking assessment designed for student placement and diagnosis. The CFA and G theory approaches provided complementary information, which generally confirmed the key features of the assessment design: (1) the multicomponential and yet highly correlated nature of the five analytic rating scales: Pronunciation, Vocabulary, Cohesion, Organization and Grammar, (2) the high dependability of the ratings and the resulting placement decisions appropriate for the hi...

...read moreread less

79 citations

Journal Article•DOI•

[...]

F. Scott Walters¹•Institutions (1)

Queens College¹

A Confirmatory Approach to Differential Item Functioning on an ESL Reading Assessment.

TL;DR: This paper focused on the behavior of two CA-trained raters applying a holistic rubric to responses on a test of ESL oral pragmatic competence, and found that post-rating hermeneutic dialogues between the raters provided evidence that valid inferences of examinee oral ESL pragmatic ability could be made through iterative, rater recourse to empirical data in a conversationanalytic mode.

...read moreread less

Abstract: Speech act theory-based, second language pragmatics testing (SLPT) poses problems for validation due to a lack of correspondence with empirical conversational data. Since conversation analysis (CA) provides a richer and more accurate account of language behavior, it may be preferred as a basis for SLPT development. However, applying CA methodology in turn poses epistemological and practical challenges to psychometrics-driven language testing. The present pilot study, attempting to resolve this seeming conflict, focuses on the behavior of two CA-trained raters applying a holistic rubric to responses on a test of ESL oral pragmatic competence. Results showed that CA-informed testing (CAIT) could be practical, though statistical reliability was not achieved. However, post-rating hermeneutic dialogues between the raters provided evidence that valid inferences of examinee oral ESL pragmatic ability could be made through iterative, rater recourse to empirical data in a conversation-analytic mode. A model for it...

...read moreread less

67 citations

Journal Article•DOI•

[...]

Marilyn L. Abbott

Validity and Topic Generality of a Writing Performance Test.

TL;DR: In this article, the authors describe a practical application of the Roussos and Stout (1996) multidimensional analysis framework for interpreting group performance differences on an ESL reading proficiency test.

...read moreread less

Abstract: In this article, I describe a practical application of the Roussos and Stout (1996) multidimensional analysis framework for interpreting group performance differences on an ESL reading proficiency test. Although a variety of statistical methods have been developed for flagging test items that function differentially for equal ability examinees from different ethnic, linguistic, or gender groups, the standard differential item functioning (DIF) detection and review procedures have not been very useful in explaining why DIF occurs in the flagged items (Standards for Educational and Psychological Testing 1999). To address this problem, Douglas, Roussos and Stout (1996) developed a confirmatory approach to DIF, which is used to test DIF hypotheses that are generated from theory and substantive item analyses. In the study described in this paper, DIF and differential bundle functioning (DBF) analyses were conducted to determine whether groups of reading test items, classified according to a bottom-up, top-down reading strategy framework, functioned differentially for equal ability Arabic and Mandarin ESL learners. SIBTEST (Stout and Roussos, 1999) analyses revealed significant systematic group differences in two of the bottom-up and two of the top-down reading strategy categories. These results demonstrate the utility of employing a theoretical framework for interpreting group differences on a reading test.

...read moreread less

58 citations

Journal Article•DOI•

[...]

Hee Kyung Lee¹, Carolyn J. Anderson•Institutions (1)

Yonsei University¹

Evaluating analytic scoring for the TOEFL® Academic Speaking Test (TAST) for operational use

TL;DR: This paper examined the validity and topic generality of a writing performance test designed to place international students into appropriate ESL courses at a large midwestern university and found that students' majors were not related to their writing performance.

...read moreread less

Abstract: The goal of the current study was to examine the validity and topic generality of a writing performance test designed to place international students into appropriate ESL courses at a large mid-western university. Because for each test administration the test randomly rotates three academic topics integrated with listening and reading sources, it is necessary to investigate the extent to which the three topics are compatible in terms of difficulty and generality across a diverse group of examinees. ESL Placement Test (EPT) scores from more than 1,000 examinees were modeled using multinomial logistic regression. Possible explanatory variables were identified as the assigned writing topic, students' majors, and their scores on the Test of English as a Foreign Language (TOEFL). Results indicate that after controlling for general English proficiency as measured by the TOEFL, students' majors were not related to their writing performance; however, the different topics did affect performance. In light of test v...

...read moreread less

54 citations

Journal Article•DOI•

[...]

Xiaoming Xi

The challenges of the Ontario Secondary School Literacy Test for second language students

TL;DR: This paper explored the utility of analytic scoring for TAST in providing useful and reliable diagnostic information for operational use in three aspects of candidates' performance: delivery, language use and topic development.

...read moreread less

Abstract: This study explores the utility of analytic scoring for TAST in providing useful and reliable diagnostic information for operational use in three aspects of candidates' performance: delivery, language use and topic development. One hundred and forty examinees' responses to six TAST tasks were scored analytically on these three aspects of speech. G studies were used to investigate the dependability of the analytic scores, the distinctness of the analytic dimensions, and the variability of analytic score profiles. Raters' perceptions of dimension separability were obtained using a questionnaire. It was found that the dependability of analytic scores averaged across six tasks and double ratings was acceptable for both operational and practice settings. However, scores averaged across two tasks and double ratings were not reliable enough for operational use. Correlations among the analytic scores by task were high but those between delivery and topic development were lower. These results were corroborated by ...

...read moreread less

Journal Article•DOI•

[...]

Liying Cheng¹, Don A. Klinger¹, Ying Zheng¹•Institutions (1)

Queen's University¹

Validating a standards-based classroom assessment of English proficiency: A multitrait-multimethod approach

TL;DR: This paper examined the 2002 and 2003 OSSLT test performances of ESL/ELD and non-ESL/ELD students in order to identify and understand the factors that may help explain why ESL and ELL students failed the test at relatively high rates.

...read moreread less

Abstract: Results from the Ontario Secondary School Literacy Test (OSSLT) indicate that English as a Second Language (ESL) and English Literacy Development (ELD) students have comparatively low success and high deferral rates. This study examined the 2002 and 2003 OSSLT test performances of ESL/ELD and non-ESL/ELD students in order to identify and understand the factors that may help explain why ESL/ELD students failed the test at relatively high rates. The analyses also attempted to determine if there were significant and systematic differences in ESL/ELD students' test performance. The performance of ESL/ELD students was consistently and similarly lower across item formats, reading text types, skills and strategies, and the four writing tasks. Using discriminant analyses, it was found that narrative text type, indirect understanding skill, vocabulary strategy of reading, and the news report writing task were significant predictors of ESL/ELD membership. The results of this study provide direction for further research and instruction regarding English literacy achievement for these second language students within the context of having to complete large-scale English literacy tests designed and constructed for first English language students.

...read moreread less

Journal Article•DOI•

[...]

Lorena Llosa¹•Institutions (1)

New York University¹

Students' voices in the evaluation of their written summaries: Empowerment and democracy for test takers?:

TL;DR: The authors examined the extent to which the English Language Development (ELD) Classroom Assessment measures the same constructs as the CELDT (California English Language development Test), the state standardized test for English learners.

...read moreread less

Abstract: The use of standards-based classroom assessments to test English learners' language proficiency is increasingly prevalent in the United States and many other countries. In a large urban school district in California, for example, a classroom assessment is used to make high-stakes decisions about English learners' progress from one level to the next, and as one of the criteria for reclassifying students as Fluent English Proficient. Yet many researchers have questioned the validity of using classroom assessments for making high-stakes decisions about students (Brindley, 1998; 2001; Rea-Dickins and Gardner, 2000). One way to investigate the validity of the inferences drawn from these assessments is to examine them in relation to other measures of the same ability. In this study, a multivariate analytic approach was used to examine the extent to which the English Language Development (ELD) Classroom Assessment measures the same constructs as the CELDT (California English Language Development Test), the state...

...read moreread less

Journal Article•DOI•

[...]

Guoxing Yu¹•Institutions (1)

University of Bristol¹

Proficiency descriptors based on a scale-anchoring study of the new TOEFL iBT reading test:

TL;DR: In this article, two kinds of scoring templates were empirically derived from summaries written by experts and students to evaluate the quality of summary written by the students, and students' att...

...read moreread less

Abstract: Two kinds of scoring templates were empirically derived from summaries written by experts and students to evaluate the quality of summaries written by the students. This paper reports students' att...

...read moreread less

Journal Article•DOI•

[...]

Pablo Garcia Gomez, Aris Noah, Mary Schedl, Christine Wright, Aline Yolkut - Show less +1 more

Bilingual dictionaries in tests of l2 writing proficiency: do they make a difference?

TL;DR: In this article, the authors provide information to test takers and test score users about the abilities of test taker at different score levels in educational and psychological measurement (e.g., test scores).

...read moreread less

Abstract: Providing information to test takers and test score users about the abilities of test takers at different score levels has been a persistent problem in educational and psychological measurement (Ca...

...read moreread less

Journal Article•DOI•

[...]

Martin East¹•Institutions (1)

Unitec Institute of Technology¹

Investigating the performance of alternative types of grammar items

TL;DR: The question of whether test takers should be allowed access to dictionaries when taking L2 tests has been the subject of debate for a good number of years as mentioned in this paper, and opinions differ according to how the test construct is used.

...read moreread less

Abstract: Whether test takers should be allowed access to dictionaries when taking L2 tests has been the subject of debate for a good number of years. Opinions differ according to how the test construct is u...

...read moreread less

Journal Article•DOI•

[...]

Gergely Dávid¹•Institutions (1)

Eötvös Loránd University¹

Book Review: Statistical analyses for language assessment

TL;DR: This article evaluated Multitrak items in relation to three better-known MC item types and were shown to provide more information about candidates in the ability range represented by Levels B2 and C1 on the Common European Framework of Reference (CEFR), as well as allowing a focus on more difficult content than the other item types in the study.

...read moreread less

Abstract: Some educational contexts almost mandate the application of multiplechoice (MC) testing techniques, even if they are deplored by many practitioners in the field. In such contexts especially, research into how well these types of item perform and how their performance may be characterised is both appropriate and desirable. The focus of this paper is on a modified type of MC item dubbed “multitrak”, which was used to test grammar in the Hungarian national admissions test for English between 1993 and 2004. Multitrak items were evaluated in relation to three better-known MC item types and were shown to provide more information about candidates in the ability range represented by Levels B2 and C1 on the Common European Framework of Reference (CEFR), as well as allowing a focus on more difficult content than the other item types in the study. I Introduction Performance in the context of a language test is similar to communicative performance in that the construct measured is coloured by various features of that performance, including those of the test method. Since no test method is monolithic, even in an objective selected-response item format we may find a variety of test method facets (such as the number of response options provided for an MC item or the number and position of blanks in the stem) which may constrain or enhance the measurement of the construct and therefore warrant close investigation. The item format may limit or prevent certain construct elements from being included in the test, or otherwise interfere with it, causing distortions in the scores, with the possible result that they no longer reflect the construct very well. Alternatively,

...read moreread less

Journal Article•DOI•

[...]

James Dean Brown

Book Review: Statistical analyses for language assessment workbook and CD ROM

Journal Article•DOI•

[...]

James Dean Brown

Book reviews: Purpura, J.E. 2004. Assessing grammar. Cambridge: Cambridge University Press. 305 pp. ISBN: 0-521-00344-X (paperback) US$32:

TL;DR: Bachman and Kunnan's workbook for Bachman as discussed by the authors provides carefully planned reinforcement in a variety of practical examples and hands-on exercises, and also includes a CD that is loaded with useful data and other sorts of files.

...read moreread less

Abstract: Bachman and Kunnan (2005) is the workbook for Bachman (2004), which will hereafter be referred to as the main book (see separate review in this volume). Any workbook is defined in large part by its relationship to the main book, and this workbook is no exception. This workbook does not stand alone, depending as it does on the explanations in the main book. Yet it will clearly help any students who are serious about learning the material in the main book because this workbook provides carefully planned reinforcement in a variety of practical examples and hands-on exercises. The workbook also includes a CD that is loaded with useful data and other sorts of files. Given that the workbook and CD are complex and highly interrelated, I will begin by discussing the workbook and then turn to the CD, after which I will consider how well the two fit together as a package with the main book.

...read moreread less

Journal Article•DOI•

[...]

Wayne Rimmer¹•Institutions (1)

University of Reading¹

Book review: Weigle, S.C. 2002: Assessing writing. Cambridge, UK: Cambridge University Press. xiv, 268 pp. ISBN: 0521780276 (cloth) 0521784468 (paperback)

Journal Article•DOI•

[...]

Muhammad Usman Erdősy¹•Institutions (1)

Carleton University¹

Book reviews: Lumley, T. 2005: Assessing second language writing: the rater's perspective. Frankfurt: Peter Lang (Volume 3, Language Testing and Evaluation Series, edited by Rüdiger Grotjahn and Günther Sigott). 368 pp. ISBN 3-631-53327-6 US-ISBN 0-8204-7655-2 US$62.95

Journal Article•DOI•

[...]

Alister Cumming¹•Institutions (1)

Ontario Institute for Studies in Education¹