scispace - formally typeset
Search or ask a question

Showing papers in "Language Testing in 2006"


Journal ArticleDOI
TL;DR: This paper provided renewed converging empirical evidence for the hypothesis that asking test-takers to respond to text passages with multiple-choice questions induces response processes that are strikingly different from those that respondents would draw on when reading in non-testing contexts.
Abstract: This article provides renewed converging empirical evidence for the hypothesis that asking test-takers to respond to text passages with multiple-choice questions induces response processes that are strikingly different from those that respondents would draw on when reading in non-testing contexts. Moreover, the article shows that the construct of reading comprehension is assessment specific and is fundamentally determined through item design and text selection. The data come from qualitative analyses of 10 cognitive interviews conducted with non-native adult English readers who were given three passages with several multiple-choice questions from the CanTEST, a large-scale language test used for admission and placement purposes in Canada, in a partially counter-balanced design. The analyses show that:• There exist multiple different representations of the construct of ‘reading comprehension’ that are revealed through the characteristics of the items.• Learners view responding to multiple-choice questions ...

212 citations


Journal ArticleDOI
TL;DR: In this paper, the authors examined the hypothesis that C-tests measure general language proficiency and showed that language proficiency was divisible into more specific constructs and that examinee proficiency level differentially influenced C-test performance.
Abstract: What C-tests actually measure has been an issue of debate for many years. In the present research, the authors examined the hypothesis that C-tests measure general language proficiency. A total of 843 participants from four independent samples took a German C-test along with the TestDaF (Test of German as a Foreign Language). Rasch measurement modelling and confirmatory factor analysis provided clear evidence that the C-test in question was a highly reliable, unidimensional instrument, which measured the same general dimension as the four TestDaF sections: reading, listening, writing and speaking. Moreover, the authors showed that language proficiency was divisible into more specific constructs and that examinee proficiency level differentially influenced C-test performance. The findings have implications for the multicomponentiality and fluidity of the C-test measurement construct.

176 citations


Journal ArticleDOI
TL;DR: Despite increasing interest in inter-language pragmatics research, research on assessment of this crucial area of second language competence still lags behind assessment of other aspects of learners as discussed by the authors.
Abstract: Despite increasing interest in interlanguage pragmatics research, research on assessment of this crucial area of second language competence still lags behind assessment of other aspects of learners...

125 citations


Journal ArticleDOI
TL;DR: Huibregtse et al. as discussed by the authors found that the Yes/No test is a valid measure of the type of L2 vocabulary knowledge assessed by the VLT, with implications for classroom application.
Abstract: Performance on the Yes/No test (Huibregtse et al., 2002) was assessed as a predictor of scores on the Vocabulary Levels Test (VLT), a standard test of receptive second language (L2) vocabulary knowledge (Nation, 1990). The use of identical items on both tests allowed a direct comparison of test performance, with alternative methods for scoring the Yes/No test also examined (Huibregtse et al., 2002). Overall, performance on both tests by English L2 university students (n - 36) was similar. Mean test accuracy on the various Yes/No methods ranged from 76-82%, comparable to VLT performance at 83%. However, paired t-tests showed the scoring methods used to correct raw hit performance increased the difference between the Yes/No test and criterion VLT scores to some degree. All Yes/No scores were strong predictors of VLT performance, regardless of method used, r = .8. Raw hit rate was the best predictor of VLT performance, due in part to the >5% false alarm rate. The low false alarm rate may be due to the participants, drawn primarily from non-Latin alphabet first languages (L1s), and the nature of the instructions. The results indicate the Yes/No test is a valid measure of the type of L2 vocabulary knowledge assessed by the VLT, with implications for classroom application.

124 citations


Journal ArticleDOI
Shahrzad Saif1
TL;DR: In this paper, the authors explore the possibility of creating positive washback by focusing on factors in the background of the test development process and anticipating the conditions most likely to lead to positive wash-back.
Abstract: The aim of this study is to explore the possibility of creating positive washback by focusing on factors in the background of the test development process and anticipating the conditions most likely to lead to positive wash-back. The article reports on a multiphase empirical study investigating the washback effects of a needs-based test of spoken language proficiency on the content, teaching, classroom activities and learning outcomes of the ITA (international teaching assistants) training program linked to it. As such, the conceptual framework underlying the study differs from previous models in that it includes the processes before test development and test design as two main components of washback investigation. The analysis of the data - collected from different stakeholders through interviews, observations and test administration at different intervals before, during and after the training program - suggests a positive relationship between the test and the immediate teaching and learning outcomes. Th...

107 citations


Journal ArticleDOI
TL;DR: This article used generalizability theory (G-theory) procedures to examine the impact of the number of tasks and raters per speech sample and of subsection lengths on the dependability of speaking scores.
Abstract: A multitask speaking measure consisting of both integrated and independent tasks is expected to be an important component of a new version of the TOEFL test. This study considered two critical issues concerning score dependability of the new speaking measure: How much would the score dependability be impacted by (1) combining scores on different task types into a composite score and (2) rating each task only once? To answer these questions, generalizability theory (G-theory) procedures were used to examine the impact of the numbers of tasks and raters per speech sample and of subsection lengths on the dependability of speaking scores. Univariate and multivariate G-theory analyses were conducted on rating data collected for 261 examinees for the study. The finding in the univariate analyses was that it would be more efficient to increase the number of tasks rather than the number of ratings per speech sample in maximizing the score dependability. The multivariate G-theory analyses also revealed that (1) th...

95 citations


Journal ArticleDOI
TL;DR: This paper investigated a group oral test as administered at a university in Japan to find if it is appropriate to use scores for higher-stakes decision-making, and found that it is one component of an in-house test.
Abstract: This article investigates a group oral test as administered at a university in Japan to find if it is appropriate to use scores for higher stakes decision making. It is one component of an in-house...

95 citations


Journal ArticleDOI
TL;DR: In this article, the authors employed both the IRT-LR (item response theory likelihood ratio) and a series of CFA (confirmatory factor analysis) multi-sample analyses to systematically examine the relationships between DIF (differential item functioning) and DTF (differentially test functioning) with a random sample of 15 000 Korean examinees.
Abstract: The present study utilized both the IRT-LR (item response theory likelihood ratio) and a series of CFA (confirmatory factor analysis) multi-sample analyses to systematically examine the relationships between DIF (differential item functioning) and DTF (differential test functioning) with a random sample of 15 000 Korean examinees. Specifically, DIF was detected using the IRT-LR method and the cumulative effect of DIF on DTF was gauged by the multi-sample analysis technique offered by the LISREL 8.5 program. The results of the current study indicate that item level DIF, once detected, may be carried to the test level bias regardless of the DIF directions, thereby showing mixed evidence to the previous findings reported in the literature. This suggests that the relation of DIF to DTF seems to be much more complex than that reported in the literature and, accordingly, more empirical studies are needed to bridge the gap in the literature about DIF-DTF relationships.

65 citations


Journal ArticleDOI
TL;DR: Grammar is central to language description and a posteriori construct validation of language tests consistently identifies grammar as a significant factor in differentiating between score levels an... as mentioned in this paper,.
Abstract: Grammar is central to language description and a posteriori construct validation of language tests consistently identifies grammar as a significant factor in differentiating between score levels an...

63 citations


Journal ArticleDOI
TL;DR: This paper reported the results of an investigation, based on a 170 000-word corpus of test performance, of the validity of the College English Test-Spoken English Test (CET-SET) group discussion by e...
Abstract: This article reports the results of an investigation, based on a 170 000-word corpus of test performance, of the validity of College English Test-Spoken English Test (CET-SET) group discussion by e...

63 citations


Journal ArticleDOI
TL;DR: This article found that the translation task yielded significantly more evidence of comprehension than did the immediate recall task, which indicates that the requirement of memory in the recall task hinders test-takers' ability to demonstrate fully their comprehension ability.
Abstract: The immediate written recall task, a widely used measure of both first language (L1) and second language (L2) reading comprehension, has been advocated over traditional test methods such as multiple choice, cloze tests and open-ended questions because it is a direct and integrative assessment task. It has been, however, criticized as requiring memory. Whether and how the requirement of memory biases our understanding of readers’ comprehension remains unexplored. This study compares readers’ performance on the immediate recall and a translation task in order to explore the effect of memory on readers’ recall. Ninety-seven college students participated in this study. All participants were native speakers of Mandarin Chinese whose ages ranged from 20 to 22. The results showed that the translation task yielded significantly more evidence of comprehension than did the immediate recall task, which indicates that the requirement of memory in the recall task hinders test-takers’ ability to demonstrate fully their...

Journal ArticleDOI
Abstract: The present study investigated the effects of reducing the number of options per item on psychometric characteristics of a Japanese EFL university entrance examination. A four-option multiple-choic...

Journal ArticleDOI
TL;DR: This article investigated the effect of background knowledge in languages for specific academic purposes (LSAP) tests and found that background knowledge was positively associated with the performance of background-knowledge-free languages.
Abstract: This study investigates the effect of background knowledge in languages for specific academic purposes (LSAP) tests. Following the observation of previous studies that the effect of background know...

Journal ArticleDOI
TL;DR: The authors investigate the extent to which exam boards are able to provide evidence of comparability across forms, and investigate the effect of the board's failure to provide comparability between forms across forms.
Abstract: Examination boards are often criticized for their failure to provide evidence of comparability across forms, and few such studies are publicly available. This study aims to investigate the extent t...

Journal ArticleDOI
TL;DR: This paper studied how a foreign language test got discursively constructed in the talk of upper-secondary-school leavers, and identified four interpretative repertoires in the students' accounts with different constructions of themselves as test-takers, the test, and their performance in the test.
Abstract: As part of a larger project, we studied how a foreign language test got discursively constructed in the talk of upper-secondary-school leavers. A group of students were asked to keep an oral diary to record their ideas, feelings and experiences of preparing for and taking the test over the last spring term of school, as part of a high-stakes national examination. In addition, they took part in discussions either in pairs or groups of three after having learned about the final test results. After transcribing the data, drawing on a form of discourse analysis originally launched by a group of social psychologists, we identified (at least) four interpretative repertoires in the students’ accounts - with different constructions of themselves as test-takers, the test, and their performance in the test - including expectations and explanations for success or failure as well as credit or blame. The findings point to variation in the uses of these repertoires, not only from one context to another but also from mo...

Journal ArticleDOI
TL;DR: The authors used a new methodological approach to describe variation in test task characteristics and explore how differences in these characteristics might relate to examinee performance in TOEFL reading, and found that the characteristics of the texts used in a high-stakes test of English for Academic Purposes reading vary.
Abstract: The present study focuses on the task characteristics of reading passages and key sentences in a test of second language reading. Using a new methodological approach to describe variation in test task characteristics and explore how differences in these characteristics might relate to examinee performance, it posed the two following research questions: First, how do the characteristics of the texts used in a high-stakes test of English for Academic Purposes reading vary? Second, what relationships exist between its text characteristics and examinee performance?An expanded test task characteristics instrument was constructed, following Freedle and Kostin (1993) and Bachman et al. (1996), and adding a large number of syntactic features (Celce-Murcia and Larsen-Freeman, 1999). Ratings and numerical counts were compiled for three forms of the Test of English as a Foreign Language (TOEFL) Reading Comprehension Section. Taking items as the object of measurement, the results were then used in a series of explora...

Journal ArticleDOI
Su Zhang1
TL;DR: This article applied generalizability theory to investigate the contributions of persons, items, sections, and language backgrounds to the score dependability of the Test of English for International Languages (TEFL).
Abstract: This study applied generalizability theory to investigate the contributions of persons, items, sections, and language backgrounds to the score dependability of the Test of English for International...

Journal ArticleDOI
TL;DR: The authors examined the multiple true-false (MTF) test format in second language testing by comparing MCQ and MTF test formats in two language areas of general English: vocabulary and reading.
Abstract: This study examined the multiple true-false (MTF) test format in second language testing by comparing multiple-choice (MCQ) and multiple true-false (MTF) test formats in two language areas of general English: vocabulary and reading. Two counter-balanced experimental designs - one for each language area - were examined in terms of the number of MCQ and MTF differentially responded to, MTF item dependency, reliability, and concurrent validity. The data were analysed by classical test theory (CTT) and Rasch analysis. The results showed a two- and three-fold increase in vocabulary and reading items answered, respectively. Participants responded to significantly more MTF items than MCQ, and further analysis revealed no item dependency for both language domains. Reliability increases were found in the reading tests. Item conversions did not alter the basic functioning of the MTF items, and common person equating plots demonstrated a steady relationship between MCQ and MTF person ability estimates.

Journal ArticleDOI
TL;DR: In this article, the authors investigate the effects of chain interaction impairment which may cost the test-takers' comprehension of texts and suggest chain-preserving deletion (CPD) as a pedagogical procedure.
Abstract: It is said that one important aspect of education is the production of coherent discourse (Halliday and Hasan, 1985). This is the speaker’s or the writer’s ability to organize relevant meanings in relation to each other, and this in turn requires the establishment of ‘chain interaction’ - relations between components of a message - in a text. The more chain interactions we have in a text, the more coherent and, as a result, the more comprehensible it will be. Based on the above argument, the present study aims at investigating the effects of chain interaction impairment which may cost the test-takers’ comprehension of texts - itself being an object of measurement in cloze (Alderson, 1983; Francis, 1999) - and account for their low performance. It also aims at suggesting ‘chain-preserving deletion’ (CPD) as a pedagogical procedure.

Journal ArticleDOI
TL;DR: In this article, Dastjerdi and Talebinezhad defined the concept of identity chains as blanks that involve the same referent, whether referred to in a lexically filled out noun phrase or through pronominals such as "he" or "himself".
Abstract: Dastjerdi and Talebinezhad (2006, hereafter referred to as DT) studied ‘chain-preserving’ deletions (CPD) in cloze tests in contrast to an every 5th word deletion procedure (as applied to 150 words of George Orwell’s Animal Farm, 1945). Their definition of CPD is loosely based on the writings of Halliday and Hasan (1976) and Halliday (1985) about cohesion and coherence. Among the contributing elements that Halliday and Hasan (1985: 48) identified – and that are adopted by DT – are reference, substitution and ellipsis, conjunction, and lexical cohesion. DT exemplify what they have in mind by CPD under the broad category of reference as blanks that involve either the same referent, whether referred to in a lexically filled out noun phrase, e.g. as ‘Mr Jones’ or through pronominals such as ‘he’ or ‘himself’. When distinct phrases have the same referent, DT call them ‘identity chains’. The rest of their examples they lump under ‘similarity chains’. These involve predicates thematically associated with the same referent (Mr Jones), e.g. ‘was drunk’, ‘lurched’, and ‘kicked off’, or an understood time sequence, e.g. ‘night’, ‘light’, and ‘day’. They argue that if such ‘chains’ are kept intact when inserting blanks in a text to create a cloze test, the text will remain more coherent and this will make it easier for readers to fill in the blanks. These notions are derived from Halliday and Hasan’s (1985: 83) descriptions of ‘cohesive ties’ as ‘chains’. The key underlying concepts are connected with the terms ‘cohesion’ and ‘coherence’, but Halliday and Hasan (1976; 1985) are not consistent in defining these terms. According to them what they call ‘texture’, which seems to be the essence of meaningful discourse, is

Journal ArticleDOI
TL;DR: Yan's critique of Dastjerdi and Talebinezhad as discussed by the authors is interesting for a number reasons, and the authors are grateful for her comments, which can lead to a better understanding of both concepts of "chains" and deletion procedures in cloze testing.
Abstract: Yan’s critique of Dastjerdi and Talebinezhad (2006; henceforth, DT) is interesting for a number reasons, and the authors are grateful for her comments, which can lead to a better understanding of both concepts of ‘chains’ and deletion procedures in cloze testing. The following reply to her comments appears in the same order that she refers to the points at issue. The first point of concern for Yan is the definition of ‘chains’ and ‘cohesive ties’, which DT have derived from Halliday and Hasan’s (1985) concepts of ‘cohesion’ and ‘coherence’. Yan is right that Halliday and Hasan (1976; 1985) are not consistent in defining their terms, but this has little to do with CPD as proposed by DT. In CPD, the attempt is to preserve the text from becoming non-text, both structurally and semantically, in the process of making cloze tests, through keeping the texture of the text intact. This, of course, as Bachman (1985) says, might be difficult to achieve and might eliminate the cloze test’s ‘ease of construction’, which sounds to us not a suitable justification to exclude a strategy that may lead to rationallymade tests. The focus of our article is actually how to maintain text coherence and cohesion (i.e. meaning relations and the sewing elements of text respectively) in a cloze test, not to make it easier for the test-taker. The basis for the realization of the idea here can be any other definition of ‘chains’, ‘chain interactions’, etc., more consistent and more objective than that of Halliday and Hasan. Yan’s point concerning the learning effects has been taken care of by the two-week lapse in administering the test. However, we agree that this could have been accounted for more quantitatively through the procedure she mentions. It can be a good point for further research in this area.