Showing papers in &quot;Language Testing in 2006&quot;

A closer look at the construct validity of C-tests:

TL;DR: This paper provided renewed converging empirical evidence for the hypothesis that asking test-takers to respond to text passages with multiple-choice questions induces response processes that are strikingly different from those that respondents would draw on when reading in non-testing contexts.

...read moreread less

Abstract: This article provides renewed converging empirical evidence for the hypothesis that asking test-takers to respond to text passages with multiple-choice questions induces response processes that are strikingly different from those that respondents would draw on when reading in non-testing contexts. Moreover, the article shows that the construct of reading comprehension is assessment specific and is fundamentally determined through item design and text selection. The data come from qualitative analyses of 10 cognitive interviews conducted with non-native adult English readers who were given three passages with several multiple-choice questions from the CanTEST, a large-scale language test used for admission and placement purposes in Canada, in a partially counter-balanced design. The analyses show that:• There exist multiple different representations of the construct of ‘reading comprehension’ that are revealed through the characteristics of the items.• Learners view responding to multiple-choice questions ...

...read moreread less

212 citations

Journal Article•DOI•

[...]

Thomas Eckes, Rüdiger Grotjahn¹•Institutions (1)

Ruhr University Bochum¹

Validation of a web-based test of ESL pragmalinguistics:

TL;DR: In this paper, the authors examined the hypothesis that C-tests measure general language proficiency and showed that language proficiency was divisible into more specific constructs and that examinee proficiency level differentially influenced C-test performance.

...read moreread less

Abstract: What C-tests actually measure has been an issue of debate for many years. In the present research, the authors examined the hypothesis that C-tests measure general language proficiency. A total of 843 participants from four independent samples took a German C-test along with the TestDaF (Test of German as a Foreign Language). Rasch measurement modelling and confirmatory factor analysis provided clear evidence that the C-test in question was a highly reliable, unidimensional instrument, which measured the same general dimension as the four TestDaF sections: reading, listening, writing and speaking. Moreover, the authors showed that language proficiency was divisible into more specific constructs and that examinee proficiency level differentially influenced C-test performance. The findings have implications for the multicomponentiality and fluidity of the C-test measurement construct.

...read moreread less

176 citations

Journal Article•DOI•

[...]

Carsten Roever¹•Institutions (1)

University of Melbourne¹

The Yes/No test as a measure of receptive vocabulary knowledge:

TL;DR: Despite increasing interest in inter-language pragmatics research, research on assessment of this crucial area of second language competence still lags behind assessment of other aspects of learners as discussed by the authors.

...read moreread less

Abstract: Despite increasing interest in interlanguage pragmatics research, research on assessment of this crucial area of second language competence still lags behind assessment of other aspects of learners...

...read moreread less

125 citations

Journal Article•DOI•

[...]

kira Mochida¹, Michael Harrington¹•Institutions (1)

University of Queensland¹

Aiming for Positive Washback: A Case Study of International Teaching Assistants.

TL;DR: Huibregtse et al. as discussed by the authors found that the Yes/No test is a valid measure of the type of L2 vocabulary knowledge assessed by the VLT, with implications for classroom application.

...read moreread less

Abstract: Performance on the Yes/No test (Huibregtse et al., 2002) was assessed as a predictor of scores on the Vocabulary Levels Test (VLT), a standard test of receptive second language (L2) vocabulary knowledge (Nation, 1990). The use of identical items on both tests allowed a direct comparison of test performance, with alternative methods for scoring the Yes/No test also examined (Huibregtse et al., 2002). Overall, performance on both tests by English L2 university students (n - 36) was similar. Mean test accuracy on the various Yes/No methods ranged from 76-82%, comparable to VLT performance at 83%. However, paired t-tests showed the scoring methods used to correct raw hit performance increased the difference between the Yes/No test and criterion VLT scores to some degree. All Yes/No scores were strong predictors of VLT performance, regardless of method used, r = .8. Raw hit rate was the best predictor of VLT performance, due in part to the >5% false alarm rate. The low false alarm rate may be due to the participants, drawn primarily from non-Latin alphabet first languages (L1s), and the nature of the instructions. The results indicate the Yes/No test is a valid measure of the type of L2 vocabulary knowledge assessed by the VLT, with implications for classroom application.

...read moreread less

124 citations

Journal Article•DOI•

[...]

Shahrzad Saif¹•Institutions (1)

Laval University¹

Dependability of scores for a new ESL speaking assessment consisting of integrated and independent tasks

TL;DR: In this paper, the authors explore the possibility of creating positive washback by focusing on factors in the background of the test development process and anticipating the conditions most likely to lead to positive wash-back.

...read moreread less

Abstract: The aim of this study is to explore the possibility of creating positive washback by focusing on factors in the background of the test development process and anticipating the conditions most likely to lead to positive wash-back. The article reports on a multiphase empirical study investigating the washback effects of a needs-based test of spoken language proficiency on the content, teaching, classroom activities and learning outcomes of the ITA (international teaching assistants) training program linked to it. As such, the conceptual framework underlying the study differs from previous models in that it includes the processes before test development and test design as two main components of washback investigation. The analysis of the data - collected from different stakeholders through interviews, observations and test administration at different intervals before, during and after the training program - suggests a positive relationship between the test and the immediate teaching and learning outcomes. Th...

...read moreread less

107 citations

Journal Article•DOI•

[...]

Yong-Won Lee

Validity Evidence in a University Group Oral Test.

TL;DR: This article used generalizability theory (G-theory) procedures to examine the impact of the number of tasks and raters per speech sample and of subsection lengths on the dependability of speaking scores.

...read moreread less

Abstract: A multitask speaking measure consisting of both integrated and independent tasks is expected to be an important component of a new version of the TOEFL test. This study considered two critical issues concerning score dependability of the new speaking measure: How much would the score dependability be impacted by (1) combining scores on different task types into a composite score and (2) rating each task only once? To answer these questions, generalizability theory (G-theory) procedures were used to examine the impact of the numbers of tasks and raters per speech sample and of subsection lengths on the dependability of speaking scores. Univariate and multivariate G-theory analyses were conducted on rating data collected for 261 examinees for the study. The finding in the univariate analyses was that it would be more efficient to increase the number of tasks rather than the number of ratings per speech sample in maximizing the score dependability. The multivariate G-theory analyses also revealed that (1) th...

...read moreread less

95 citations

Journal Article•DOI•

[...]

Alistair Van Moere¹•Institutions (1)

Lancaster University¹

Examining the Relationship between Differential Item Functioning and Differential Test Functioning.

TL;DR: This paper investigated a group oral test as administered at a university in Japan to find if it is appropriate to use scores for higher-stakes decision-making, and found that it is one component of an in-house test.

...read moreread less

Abstract: This article investigates a group oral test as administered at a university in Japan to find if it is appropriate to use scores for higher stakes decision making. It is one component of an in-house...

...read moreread less

95 citations

Journal Article•DOI•

[...]

Tae-Il Pae¹, Gi-Pyo Park²•Institutions (2)

Yeungnam University¹, Soonchunhyang University²

Measuring Grammatical Complexity: The Gordian Knot.

TL;DR: In this article, the authors employed both the IRT-LR (item response theory likelihood ratio) and a series of CFA (confirmatory factor analysis) multi-sample analyses to systematically examine the relationships between DIF (differential item functioning) and DTF (differentially test functioning) with a random sample of 15 000 Korean examinees.

...read moreread less

Abstract: The present study utilized both the IRT-LR (item response theory likelihood ratio) and a series of CFA (confirmatory factor analysis) multi-sample analyses to systematically examine the relationships between DIF (differential item functioning) and DTF (differential test functioning) with a random sample of 15 000 Korean examinees. Specifically, DIF was detected using the IRT-LR method and the cumulative effect of DIF on DTF was gauged by the multi-sample analysis technique offered by the LISREL 8.5 program. The results of the current study indicate that item level DIF, once detected, may be carried to the test level bias regardless of the DIF directions, thereby showing mixed evidence to the previous findings reported in the literature. This suggests that the relation of DIF to DTF seems to be much more complex than that reported in the literature and, accordingly, more empirical studies are needed to bridge the gap in the literature about DIF-DTF relationships.

...read moreread less

65 citations

Journal Article•DOI•

[...]

Wayne Rimmer¹•Institutions (1)

University of Reading¹

A corpus-based investigation into the validity of the CET-SET group discussion

TL;DR: Grammar is central to language description and a posteriori construct validation of language tests consistently identifies grammar as a significant factor in differentiating between score levels an... as mentioned in this paper,.

...read moreread less

Abstract: Grammar is central to language description and a posteriori construct validation of language tests consistently identifies grammar as a significant factor in differentiating between score levels an...

...read moreread less

63 citations

Journal Article•DOI•

[...]

Lianzhen He¹, Ying Dai¹•Institutions (1)

Zhejiang University¹

On the use of the immediate recall task as a measure of second language reading comprehension

TL;DR: This paper reported the results of an investigation, based on a 170 000-word corpus of test performance, of the validity of the College English Test-Spoken English Test (CET-SET) group discussion by e...

...read moreread less

Abstract: This article reports the results of an investigation, based on a 170 000-word corpus of test performance, of the validity of College English Test-Spoken English Test (CET-SET) group discussion by e...

...read moreread less

63 citations

Journal Article•DOI•

[...]

Yuh-Fang Chang¹•Institutions (1)

National Chung Hsing University¹

A comparison of three- and four-option English tests for university entrance selection purposes in Japan:

TL;DR: This article found that the translation task yielded significantly more evidence of comprehension than did the immediate recall task, which indicates that the requirement of memory in the recall task hinders test-takers' ability to demonstrate fully their comprehension ability.

...read moreread less

Abstract: The immediate written recall task, a widely used measure of both first language (L1) and second language (L2) reading comprehension, has been advocated over traditional test methods such as multiple choice, cloze tests and open-ended questions because it is a direct and integrative assessment task. It has been, however, criticized as requiring memory. Whether and how the requirement of memory biases our understanding of readers’ comprehension remains unexplored. This study compares readers’ performance on the immediate recall and a translation task in order to explore the effect of memory on readers’ recall. Ninety-seven college students participated in this study. All participants were native speakers of Mandarin Chinese whose ages ranged from 20 to 22. The results showed that the translation task yielded significantly more evidence of comprehension than did the immediate recall task, which indicates that the requirement of memory in the recall task hinders test-takers’ ability to demonstrate fully their...

...read moreread less

Journal Article•DOI•

[...]

Tetsuhito Shizuka¹, Osamu Takeuchi¹, Tomoko Yashima¹, Kiyomi Yoshizawa¹•Institutions (1)

Kansai University¹

Language for special academic purposes (LSAP) testing: the effect of background knowledge revisited

Abstract: The present study investigated the effects of reducing the number of options per item on psychometric characteristics of a Japanese EFL university entrance examination. A four-option multiple-choic...

...read moreread less

Journal Article•DOI•

[...]

Christian Krekeler¹•Institutions (1)

Konstanz University of Applied Sciences¹

Establishing test form and individual task comparability: a case study of a semi-direct speaking test:

TL;DR: This article investigated the effect of background knowledge in languages for specific academic purposes (LSAP) tests and found that background knowledge was positively associated with the performance of background-knowledge-free languages.

...read moreread less

Abstract: This study investigates the effect of background knowledge in languages for specific academic purposes (LSAP) tests. Following the observation of previous studies that the effect of background know...

...read moreread less

Journal Article•DOI•

[...]

Cyril J. Weir, Jessica R. W. Wu

Discursive construction of a high-stakes test: the many faces of a test-taker:

TL;DR: The authors investigate the extent to which exam boards are able to provide evidence of comparability across forms, and investigate the effect of the board's failure to provide comparability between forms across forms.

...read moreread less

Abstract: Examination boards are often criticized for their failure to provide evidence of comparability across forms, and few such studies are publicly available. This study aims to investigate the extent t...

...read moreread less

Journal Article•DOI•

[...]

Ari Huhta¹, Paula Kalaja¹, Anne Pitkänen-Huhta¹•Institutions (1)

University of Jyväskylä¹

The factor structure of test task characteristics and examinee performance

TL;DR: This paper studied how a foreign language test got discursively constructed in the talk of upper-secondary-school leavers, and identified four interpretative repertoires in the students' accounts with different constructions of themselves as test-takers, the test, and their performance in the test.

...read moreread less

Abstract: As part of a larger project, we studied how a foreign language test got discursively constructed in the talk of upper-secondary-school leavers. A group of students were asked to keep an oral diary to record their ideas, feelings and experiences of preparing for and taking the test over the last spring term of school, as part of a high-stakes national examination. In addition, they took part in discussions either in pairs or groups of three after having learned about the final test results. After transcribing the data, drawing on a form of discourse analysis originally launched by a group of social psychologists, we identified (at least) four interpretative repertoires in the students’ accounts - with different constructions of themselves as test-takers, the test, and their performance in the test - including expectations and explanations for success or failure as well as credit or blame. The findings point to variation in the uses of these repertoires, not only from one context to another but also from mo...

...read moreread less

Journal Article•DOI•

[...]

Nathan T. Carr¹•Institutions (1)

California State University, Fullerton¹

Investigating the relative effects of persons, items, sections, and languages on TOEIC score dependability

TL;DR: The authors used a new methodological approach to describe variation in test task characteristics and explore how differences in these characteristics might relate to examinee performance in TOEFL reading, and found that the characteristics of the texts used in a high-stakes test of English for Academic Purposes reading vary.

...read moreread less

Abstract: The present study focuses on the task characteristics of reading passages and key sentences in a test of second language reading. Using a new methodological approach to describe variation in test task characteristics and explore how differences in these characteristics might relate to examinee performance, it posed the two following research questions: First, how do the characteristics of the texts used in a high-stakes test of English for Academic Purposes reading vary? Second, what relationships exist between its text characteristics and examinee performance?An expanded test task characteristics instrument was constructed, following Freedle and Kostin (1993) and Bachman et al. (1996), and adding a large number of syntactic features (Celce-Murcia and Larsen-Freeman, 1999). Ratings and numerical counts were compiled for three forms of the Test of English as a Foreign Language (TOEFL) Reading Comprehension Section. Taking items as the object of measurement, the results were then used in a series of explora...

...read moreread less

Journal Article•DOI•

[...]

Su Zhang¹•Institutions (1)

University of Iowa¹

Multiple dichotomous-scored items in second language testing: investigating the multiple true-false item type under norm-referenced conditions

TL;DR: This article applied generalizability theory to investigate the contributions of persons, items, sections, and language backgrounds to the score dependability of the Test of English for International Languages (TEFL).

...read moreread less

Abstract: This study applied generalizability theory to investigate the contributions of persons, items, sections, and language backgrounds to the score dependability of the Test of English for International...

...read moreread less

Journal Article•DOI•

[...]

Albert Dudley¹•Institutions (1)

Aichi Prefectural University¹

Chain-preserving deletion procedure in cloze: a discoursal perspective

TL;DR: The authors examined the multiple true-false (MTF) test format in second language testing by comparing MCQ and MTF test formats in two language areas of general English: vocabulary and reading.

...read moreread less

Abstract: This study examined the multiple true-false (MTF) test format in second language testing by comparing multiple-choice (MCQ) and multiple true-false (MTF) test formats in two language areas of general English: vocabulary and reading. Two counter-balanced experimental designs - one for each language area - were examined in terms of the number of MCQ and MTF differentially responded to, MTF item dependency, reliability, and concurrent validity. The data were analysed by classical test theory (CTT) and Rasch analysis. The results showed a two- and three-fold increase in vocabulary and reading items answered, respectively. Participants responded to significantly more MTF items than MCQ, and further analysis revealed no item dependency for both language domains. Reliability increases were found in the reading tests. Item conversions did not alter the basic functioning of the MTF items, and common person equating plots demonstrated a steady relationship between MCQ and MTF person ability estimates.

...read moreread less

Journal Article•DOI•

[...]

Hossein Vahid Dastjerdi¹, Mohammad Reza Talebinezhad¹•Institutions (1)

University of Isfahan¹

On ‘chain-preserving deletion procedure in cloze’: a reply to Dastjerdi and Talebinezhad, 2006

TL;DR: In this article, the authors investigate the effects of chain interaction impairment which may cost the test-takers' comprehension of texts and suggest chain-preserving deletion (CPD) as a pedagogical procedure.

...read moreread less

Abstract: It is said that one important aspect of education is the production of coherent discourse (Halliday and Hasan, 1985). This is the speaker’s or the writer’s ability to organize relevant meanings in relation to each other, and this in turn requires the establishment of ‘chain interaction’ - relations between components of a message - in a text. The more chain interactions we have in a text, the more coherent and, as a result, the more comprehensible it will be. Based on the above argument, the present study aims at investigating the effects of chain interaction impairment which may cost the test-takers’ comprehension of texts - itself being an object of measurement in cloze (Alderson, 1983; Francis, 1999) - and account for their low performance. It also aims at suggesting ‘chain-preserving deletion’ (CPD) as a pedagogical procedure.

...read moreread less

Journal Article•DOI•

[...]

Ruixia Yan¹•Institutions (1)

University of Louisiana at Lafayette¹

A reply to Ruixia Yan’s critique on CPD procedure in cloze

TL;DR: In this article, Dastjerdi and Talebinezhad defined the concept of identity chains as blanks that involve the same referent, whether referred to in a lexically filled out noun phrase or through pronominals such as "he" or "himself".

...read moreread less

Abstract: Dastjerdi and Talebinezhad (2006, hereafter referred to as DT) studied ‘chain-preserving’ deletions (CPD) in cloze tests in contrast to an every 5th word deletion procedure (as applied to 150 words of George Orwell’s Animal Farm, 1945). Their definition of CPD is loosely based on the writings of Halliday and Hasan (1976) and Halliday (1985) about cohesion and coherence. Among the contributing elements that Halliday and Hasan (1985: 48) identified – and that are adopted by DT – are reference, substitution and ellipsis, conjunction, and lexical cohesion. DT exemplify what they have in mind by CPD under the broad category of reference as blanks that involve either the same referent, whether referred to in a lexically filled out noun phrase, e.g. as ‘Mr Jones’ or through pronominals such as ‘he’ or ‘himself’. When distinct phrases have the same referent, DT call them ‘identity chains’. The rest of their examples they lump under ‘similarity chains’. These involve predicates thematically associated with the same referent (Mr Jones), e.g. ‘was drunk’, ‘lurched’, and ‘kicked off’, or an understood time sequence, e.g. ‘night’, ‘light’, and ‘day’. They argue that if such ‘chains’ are kept intact when inserting blanks in a text to create a cloze test, the text will remain more coherent and this will make it easier for readers to fill in the blanks. These notions are derived from Halliday and Hasan’s (1985: 83) descriptions of ‘cohesive ties’ as ‘chains’. The key underlying concepts are connected with the terms ‘cohesion’ and ‘coherence’, but Halliday and Hasan (1976; 1985) are not consistent in defining these terms. According to them what they call ‘texture’, which seems to be the essence of meaningful discourse, is

...read moreread less

Journal Article•DOI•

[...]

Hossein Vahid Dastjerdi¹, Mohammad Reza Talebinezhad¹•Institutions (1)

University of Isfahan¹

Book Review: A focus on language test development

TL;DR: Yan's critique of Dastjerdi and Talebinezhad as discussed by the authors is interesting for a number reasons, and the authors are grateful for her comments, which can lead to a better understanding of both concepts of "chains" and deletion procedures in cloze testing.

...read moreread less

Abstract: Yan’s critique of Dastjerdi and Talebinezhad (2006; henceforth, DT) is interesting for a number reasons, and the authors are grateful for her comments, which can lead to a better understanding of both concepts of ‘chains’ and deletion procedures in cloze testing. The following reply to her comments appears in the same order that she refers to the points at issue. The first point of concern for Yan is the definition of ‘chains’ and ‘cohesive ties’, which DT have derived from Halliday and Hasan’s (1985) concepts of ‘cohesion’ and ‘coherence’. Yan is right that Halliday and Hasan (1976; 1985) are not consistent in defining their terms, but this has little to do with CPD as proposed by DT. In CPD, the attempt is to preserve the text from becoming non-text, both structurally and semantically, in the process of making cloze tests, through keeping the texture of the text intact. This, of course, as Bachman (1985) says, might be difficult to achieve and might eliminate the cloze test’s ‘ease of construction’, which sounds to us not a suitable justification to exclude a strategy that may lead to rationallymade tests. The focus of our article is actually how to maintain text coherence and cohesion (i.e. meaning relations and the sewing elements of text respectively) in a cloze test, not to make it easier for the test-taker. The basis for the realization of the idea here can be any other definition of ‘chains’, ‘chain interactions’, etc., more consistent and more objective than that of Halliday and Hasan. Yan’s point concerning the learning effects has been taken care of by the two-week lapse in administering the test. However, we agree that this could have been accounted for more quantitatively through the procedure she mentions. It can be a good point for further research in this area.

...read moreread less

Journal Article•DOI•

[...]

Janna Fox¹•Institutions (1)

Carleton University¹

Book Review: Defending standardized testing

Journal Article•DOI•

[...]

Craig W. Deville¹•Institutions (1)

University of North Carolina at Greensboro¹