scispace - formally typeset
Search or ask a question

Showing papers in "Language Testing in 2007"


Journal ArticleDOI
TL;DR: This paper explains why D is affected by text length, and demonstrates with an extensive empirical analysis that the effects of text length are significant over certain ranges, which are identified.
Abstract: A reliable index of lexical diversity (LD) has remained stubbornly elusive for over 60 years. Meanwhile, researchers in fields as varied as stylistics, neuropathology, language acquisition, and even forensics continue to use flawed LD indices — often ignorant that their results are questionable and in some cases potentially dangerous. Recently, an LD measurement instrument known as vocd has become the virtual tool of the LD trade. In this paper, we report both theoretical and empirical evidence that calls into question the rationale for vocd and also indicates that its reliability is not optimal. Although our evidence shows that vocd's output (D) is a relatively robust indicator of the aggregate probabilities of word occurrences in a text, we show that these probabilities — and thus also D — are affected by text length. Malvern, Richards, Chipere and Duran (2004) acknowledge that D (as calculated by vocd's default method) can be affected by text length, but claim that the effects are not significant for t...

238 citations


Journal ArticleDOI
TL;DR: This article examined the contribution of knowledge of syntax and knowledge of vocabulary to L2 reading in two pilot studies in different contexts and found that syntactic knowledge over vocabulary knowledge in predicting performance on a text reading comprehension test.
Abstract: In the componential approach to modelling reading ability, a number of contributory factors have been empirically validated. However, research on their relative contribution to explaining performance on second language reading tests is limited. Furthermore, the contribution of knowledge of syntax has been largely ignored in comparison with the attention focused on vocabulary. This study examines the relative contribution of knowledge of syntax and knowledge of vocabulary to L2 reading in two pilot studies in different contexts ‐ a heterogeneous population studying at the tertiary level in the UK and a homogenous undergraduate group in Japan ‐ followed by a larger main study, again involving a homogeneous Japanese undergraduate population. In contrast with previous findings in the literature, all three studies offer support for the relative superiority of syntactic knowledge over vocabulary knowledge in predicting performance on a text reading comprehension test. A case is made for the robustness of structural equation modelling compared to conventional regression in accounting for the differential reliabilities of scores on the measures employed.

222 citations


Journal ArticleDOI
TL;DR: The authors described reading and test-taking strategies that test takers used on the ''Reading' section of the LanguEdge Courseware (2002) materials developed to familiarize prospective responde...
Abstract: This study describes the reading and test-taking strategies that test takers used on the `Reading' section of the LanguEdge Courseware (2002) materials developed to familiarize prospective responde...

154 citations


Journal ArticleDOI
TL;DR: The authors compared a series of still images to video in academic computer-based tests to determine how test takers engage with these two test modes, and found that test-takers engage differently with the two modes of delivery.
Abstract: Over the past decade, listening comprehension tests have been converting to computer-based tests that include visual input. However, little research is available to suggest how test takers engage with different types of visuals on such tests. The present study compared a series of still images to video in academic computer-based tests to determine how test takers engage with these two test modes. The study, which employed observations, retrospective reports and interviews, used data from university-level non-native speakers of English. The findings suggest that test takers engage differently with these two modes of delivery. Specifically, while test takers engaged minimally and similarly with the still images, there was wide variation in the ways and degree to which they engaged with the video stimulus. Implications of the study are that computer-based tests of listening comprehension could include still images with only minimally altering the construct that is measured by audio-only listening tests, but ...

136 citations


Journal ArticleDOI
TL;DR: In this article, the authors explored the effect of online rater self-training in relation to an analytically-scored academic writing task designed to diagnose undergraduates' English learning needs.
Abstract: The use of online rater self-training is growing in popularity and has obvious practical benefits, facilitating access to training materials and rating samples and allowing raters to reorient themselves to the rating scale and self monitor their behaviour at their own convenience. However there has thus far been little research into rater attitudes to training via this modality and its effectiveness in enhancing levels of inter- and intra-rater agreement. The current study explores these issues in relation to an analyticallyscored academic writing task designed to diagnose undergraduates’ English learning needs. 8 ESL raters scored a number of pre-rated benchmark writing samples online and received immediate feedback in the form of a discrepancy score indicating the gap between their own rating of the various categories of the rating scale and the official ratings assigned to the benchmark writing samples. A batch of writing samples was rated twice (before and after participating in the online training) by each rater and Multifaceted Rasch analyses were used to compare levels of rater agreement and rater bias (on each analytic rating category). Raters’ views regarding the effectiveness of the training were also canvassed. While findings revealed limited overall gains in reliability, there was considerable individual variation in receptiveness to the training input. The paper concludes with suggestions for refining the online training program and for further research into factors influencing rater responsiveness.

110 citations


Journal ArticleDOI
TL;DR: In this article, the gap between the teaching and test syllabi in EFL has been identified as a major barrier for EFL teaching and testing syllabi, but the corresponding tests still focus on linguistic competence.
Abstract: Pragmatic proficiency has been incorporated in the EFL teaching and testing syllabi in China, but the corresponding tests still focus on linguistic competence. The gap between the teaching and test...

82 citations


Journal ArticleDOI
TL;DR: In this article, confirmatory factor analysis (CFA) and multivariate generalizability theory (G theory) were combined to analyze the responses of 214 admits to a study-abroad program to two role-play speaking tasks in a Spanish speaking assessment designed for student placement and diagnosis.
Abstract: This is a construct validation study of a second language speaking assessment that reported a language profile based on analytic rating scales and a composite score. The study addressed three key issues: score dependability, convergent/discriminant validity of analytic rating scales and the weighting of analytic ratings in the composite score. Confirmatory factor analysis (CFA) and multivariate generalizability theory (G theory) were combined to analyze the responses of 214 admits to a study-abroad program to two role-play speaking tasks in a Spanish speaking assessment designed for student placement and diagnosis. The CFA and G theory approaches provided complementary information, which generally confirmed the key features of the assessment design: (1) the multicomponential and yet highly correlated nature of the five analytic rating scales: Pronunciation, Vocabulary, Cohesion, Organization and Grammar, (2) the high dependability of the ratings and the resulting placement decisions appropriate for the hi...

79 citations


Journal ArticleDOI
F. Scott Walters1
TL;DR: This paper focused on the behavior of two CA-trained raters applying a holistic rubric to responses on a test of ESL oral pragmatic competence, and found that post-rating hermeneutic dialogues between the raters provided evidence that valid inferences of examinee oral ESL pragmatic ability could be made through iterative, rater recourse to empirical data in a conversationanalytic mode.
Abstract: Speech act theory-based, second language pragmatics testing (SLPT) poses problems for validation due to a lack of correspondence with empirical conversational data. Since conversation analysis (CA) provides a richer and more accurate account of language behavior, it may be preferred as a basis for SLPT development. However, applying CA methodology in turn poses epistemological and practical challenges to psychometrics-driven language testing. The present pilot study, attempting to resolve this seeming conflict, focuses on the behavior of two CA-trained raters applying a holistic rubric to responses on a test of ESL oral pragmatic competence. Results showed that CA-informed testing (CAIT) could be practical, though statistical reliability was not achieved. However, post-rating hermeneutic dialogues between the raters provided evidence that valid inferences of examinee oral ESL pragmatic ability could be made through iterative, rater recourse to empirical data in a conversation-analytic mode. A model for it...

67 citations


Journal ArticleDOI
TL;DR: In this article, the authors describe a practical application of the Roussos and Stout (1996) multidimensional analysis framework for interpreting group performance differences on an ESL reading proficiency test.
Abstract: In this article, I describe a practical application of the Roussos and Stout (1996) multidimensional analysis framework for interpreting group performance differences on an ESL reading proficiency test. Although a variety of statistical methods have been developed for flagging test items that function differentially for equal ability examinees from different ethnic, linguistic, or gender groups, the standard differential item functioning (DIF) detection and review procedures have not been very useful in explaining why DIF occurs in the flagged items (Standards for Educational and Psychological Testing 1999). To address this problem, Douglas, Roussos and Stout (1996) developed a confirmatory approach to DIF, which is used to test DIF hypotheses that are generated from theory and substantive item analyses. In the study described in this paper, DIF and differential bundle functioning (DBF) analyses were conducted to determine whether groups of reading test items, classified according to a bottom-up, top-down reading strategy framework, functioned differentially for equal ability Arabic and Mandarin ESL learners. SIBTEST (Stout and Roussos, 1999) analyses revealed significant systematic group differences in two of the bottom-up and two of the top-down reading strategy categories. These results demonstrate the utility of employing a theoretical framework for interpreting group differences on a reading test.

58 citations


Journal ArticleDOI
TL;DR: This paper examined the validity and topic generality of a writing performance test designed to place international students into appropriate ESL courses at a large midwestern university and found that students' majors were not related to their writing performance.
Abstract: The goal of the current study was to examine the validity and topic generality of a writing performance test designed to place international students into appropriate ESL courses at a large mid-western university. Because for each test administration the test randomly rotates three academic topics integrated with listening and reading sources, it is necessary to investigate the extent to which the three topics are compatible in terms of difficulty and generality across a diverse group of examinees. ESL Placement Test (EPT) scores from more than 1,000 examinees were modeled using multinomial logistic regression. Possible explanatory variables were identified as the assigned writing topic, students' majors, and their scores on the Test of English as a Foreign Language (TOEFL). Results indicate that after controlling for general English proficiency as measured by the TOEFL, students' majors were not related to their writing performance; however, the different topics did affect performance. In light of test v...

54 citations


Journal ArticleDOI
TL;DR: This paper explored the utility of analytic scoring for TAST in providing useful and reliable diagnostic information for operational use in three aspects of candidates' performance: delivery, language use and topic development.
Abstract: This study explores the utility of analytic scoring for TAST in providing useful and reliable diagnostic information for operational use in three aspects of candidates' performance: delivery, language use and topic development. One hundred and forty examinees' responses to six TAST tasks were scored analytically on these three aspects of speech. G studies were used to investigate the dependability of the analytic scores, the distinctness of the analytic dimensions, and the variability of analytic score profiles. Raters' perceptions of dimension separability were obtained using a questionnaire. It was found that the dependability of analytic scores averaged across six tasks and double ratings was acceptable for both operational and practice settings. However, scores averaged across two tasks and double ratings were not reliable enough for operational use. Correlations among the analytic scores by task were high but those between delivery and topic development were lower. These results were corroborated by ...

Journal ArticleDOI
TL;DR: This paper examined the 2002 and 2003 OSSLT test performances of ESL/ELD and non-ESL/ELD students in order to identify and understand the factors that may help explain why ESL and ELL students failed the test at relatively high rates.
Abstract: Results from the Ontario Secondary School Literacy Test (OSSLT) indicate that English as a Second Language (ESL) and English Literacy Development (ELD) students have comparatively low success and high deferral rates. This study examined the 2002 and 2003 OSSLT test performances of ESL/ELD and non-ESL/ELD students in order to identify and understand the factors that may help explain why ESL/ELD students failed the test at relatively high rates. The analyses also attempted to determine if there were significant and systematic differences in ESL/ELD students' test performance. The performance of ESL/ELD students was consistently and similarly lower across item formats, reading text types, skills and strategies, and the four writing tasks. Using discriminant analyses, it was found that narrative text type, indirect understanding skill, vocabulary strategy of reading, and the news report writing task were significant predictors of ESL/ELD membership. The results of this study provide direction for further research and instruction regarding English literacy achievement for these second language students within the context of having to complete large-scale English literacy tests designed and constructed for first English language students.

Journal ArticleDOI
Lorena Llosa1
TL;DR: The authors examined the extent to which the English Language Development (ELD) Classroom Assessment measures the same constructs as the CELDT (California English Language development Test), the state standardized test for English learners.
Abstract: The use of standards-based classroom assessments to test English learners' language proficiency is increasingly prevalent in the United States and many other countries. In a large urban school district in California, for example, a classroom assessment is used to make high-stakes decisions about English learners' progress from one level to the next, and as one of the criteria for reclassifying students as Fluent English Proficient. Yet many researchers have questioned the validity of using classroom assessments for making high-stakes decisions about students (Brindley, 1998; 2001; Rea-Dickins and Gardner, 2000). One way to investigate the validity of the inferences drawn from these assessments is to examine them in relation to other measures of the same ability. In this study, a multivariate analytic approach was used to examine the extent to which the English Language Development (ELD) Classroom Assessment measures the same constructs as the CELDT (California English Language Development Test), the state...

Journal ArticleDOI
Guoxing Yu1
TL;DR: In this article, two kinds of scoring templates were empirically derived from summaries written by experts and students to evaluate the quality of summary written by the students, and students' att...
Abstract: Two kinds of scoring templates were empirically derived from summaries written by experts and students to evaluate the quality of summaries written by the students. This paper reports students' att...

Journal ArticleDOI
TL;DR: In this article, the authors provide information to test takers and test score users about the abilities of test taker at different score levels in educational and psychological measurement (e.g., test scores).
Abstract: Providing information to test takers and test score users about the abilities of test takers at different score levels has been a persistent problem in educational and psychological measurement (Ca...

Journal ArticleDOI
TL;DR: The question of whether test takers should be allowed access to dictionaries when taking L2 tests has been the subject of debate for a good number of years as mentioned in this paper, and opinions differ according to how the test construct is used.
Abstract: Whether test takers should be allowed access to dictionaries when taking L2 tests has been the subject of debate for a good number of years. Opinions differ according to how the test construct is u...

Journal ArticleDOI
TL;DR: This article evaluated Multitrak items in relation to three better-known MC item types and were shown to provide more information about candidates in the ability range represented by Levels B2 and C1 on the Common European Framework of Reference (CEFR), as well as allowing a focus on more difficult content than the other item types in the study.
Abstract: Some educational contexts almost mandate the application of multiplechoice (MC) testing techniques, even if they are deplored by many practitioners in the field. In such contexts especially, research into how well these types of item perform and how their performance may be characterised is both appropriate and desirable. The focus of this paper is on a modified type of MC item dubbed “multitrak”, which was used to test grammar in the Hungarian national admissions test for English between 1993 and 2004. Multitrak items were evaluated in relation to three better-known MC item types and were shown to provide more information about candidates in the ability range represented by Levels B2 and C1 on the Common European Framework of Reference (CEFR), as well as allowing a focus on more difficult content than the other item types in the study. I Introduction Performance in the context of a language test is similar to communicative performance in that the construct measured is coloured by various features of that performance, including those of the test method. Since no test method is monolithic, even in an objective selected-response item format we may find a variety of test method facets (such as the number of response options provided for an MC item or the number and position of blanks in the stem) which may constrain or enhance the measurement of the construct and therefore warrant close investigation. The item format may limit or prevent certain construct elements from being included in the test, or otherwise interfere with it, causing distortions in the scores, with the possible result that they no longer reflect the construct very well. Alternatively,


Journal ArticleDOI
TL;DR: Bachman and Kunnan's workbook for Bachman as discussed by the authors provides carefully planned reinforcement in a variety of practical examples and hands-on exercises, and also includes a CD that is loaded with useful data and other sorts of files.
Abstract: Bachman and Kunnan (2005) is the workbook for Bachman (2004), which will hereafter be referred to as the main book (see separate review in this volume). Any workbook is defined in large part by its relationship to the main book, and this workbook is no exception. This workbook does not stand alone, depending as it does on the explanations in the main book. Yet it will clearly help any students who are serious about learning the material in the main book because this workbook provides carefully planned reinforcement in a variety of practical examples and hands-on exercises. The workbook also includes a CD that is loaded with useful data and other sorts of files. Given that the workbook and CD are complex and highly interrelated, I will begin by discussing the workbook and then turn to the CD, after which I will consider how well the two fit together as a package with the main book.



Journal ArticleDOI
TL;DR: For instance, the authors pointed out that the actual practices of writing assessments occur through raters' individual (or sometimes collaborative) interpretations and judgments while they score written texts, and that these processes are what research has recently begun to describe, accounting for the full, interpretive dimensions of composition assessment, above and beyond scoring schemes and composition tasks.
Abstract: Research has, in recent years, greatly advanced our understanding of the knowledge, decisions, and thinking that experienced raters use to score assessments of writing. For centuries, holistic, analytic, or impressionistic methods of rating written compositions have featured in tests and examinations, particularly in academic settings (Cumming, 1997; Spolsky, 1995). Descriptive criteria, rating scales, and benchmark or examplar papers have been the primary means of specifying the content of such assessments, implemented through the training, monitoring, and moderating of raters to ensure their reliability in scoring (Ruth and Murphy, 1988; Weigle, 2002). But such materials are only the tools that guide writing assessments. Overemphasizing their formal elements can trivialize the value and complexity of writing and of students’ abilities, as Purves (1992) bemoaned. Moreover, institutionalizing the status of scales for writing performance can lead to a stultifying circularity about the nature of educational achievement – whereby the scales themselves circumscribe and limit, rather than empirically describe or expand, the parameters of learning, curricula, and human abilities (Brindley, 1998; Cumming, 2001; Lantolf and Frawley, 1985). The actual practices of writing assessments occur through raters’ individual (or sometimes collaborative) interpretations and judgments while they score written texts. These processes are what research has recently begun to describe, accounting for the full, interpretive dimensions of composition assessment, above and beyond scoring schemes and composition tasks. As Connor-Linton (1995) pointedly urged, we have to look ‘behind the curtain’, into ‘raters’ minds’, to examine what composition assessments really involve. Over the past