scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Educational and Behavioral Statistics in 1998"


Journal ArticleDOI
TL;DR: This paper is written as a step-by-step tutorial that shows how to fit the two most common multilevel models: (a) school effects models, designed for data on individuals nested within naturally occurring hierarchies (e.g., students within classes); and (b) individual growth models,designed for exploring longitudinal data (on individuals) over time.
Abstract: SAS PROC MIXED is a flexible program suitable for fitting multilevel models, hierarchical linear models, and individual growth models. Its position as an integrated program within the SAS statistic...

2,903 citations


Journal ArticleDOI
TL;DR: For the comparison of more than two independent samples, the Kruskal-Wallis H test is a preferred procedure in many situations as discussed by the authors, however, the exact null and alternative hypotheses, as well as the ass...
Abstract: For the comparison of more than two independent samples the Kruskal-Wallis H test is a preferred procedure in many situations. However, the exact null and alternative hypotheses, as well as the ass...

264 citations


Journal ArticleDOI
TL;DR: A new method of controlling the exposure rate of items conditional on ability level is presented in the continuous testing environment made possible by the computerized admin-istration of a test.
Abstract: The interest in the application of large-scale adaptive testing for secure tests has served to focus attention on issues that arise when theoretical advances are made operational. One such issue is that of ensuring item and pool security in the continuous testing environment made possible by the computerized admin-istration of a test, as opposed to the more periodic testing environment typically used for linear paper-and-pencil tests. This article presents a new method of controlling the exposure rate of items conditional on ability level in this continuous testing environment. The properties of such conditional control on the exposure rates of items, when used in conjunction with a particular adaptive testing algorithm, are explored through studies with simulated data.

146 citations


Journal ArticleDOI
TL;DR: This article presented a method for handling educational data in which students belong to more than one unit at a given level, but there is missing information on the identification of the units to be assigned to students.
Abstract: This paper presents a method for handling educational data in which students belong to more than one unit at a given level, but there is missing information on the identification of the units to wh...

133 citations


Journal ArticleDOI
TL;DR: A review of fundamental concepts and applications used to address the Behrens-Fisher problem under fiducial, Bayesian, and frequentist approaches is presented in this paper.
Abstract: The Behrens-Fisher problem arises when one seeks to make inferences about the means of two normal populations without assuming the variances are equal. This paper presents a review of fundamental concepts and applications used to address the Behrens-Fisher problem under fiducial, Bayesian, and frequentist approaches. Methods of approximations to the Behrens-Fisher distribution and a simple Bayesian framework for hypothesis testing are also discussed. Finally, a discussion is provided for the use of generalized p values in significance testing of hypotheses in the presence of nuisance parameters. It is shown that the generalized p values based on a frequentist probability for the Behrens-Fisher problem are numerically the same as those from the fiducial and Bayesian solutions. A table for tests of significance is also included.

91 citations


Journal ArticleDOI
TL;DR: In this article, it was shown that parallel three-parameter logistic IRFs do not result in uniform differential item functioning (DIF) and that the term "uniform DIF" should be reserved for the condition in which the association between the item response and group is constant for all values of the matching variable, as distinguished from parallel and unidirectional DIF.
Abstract: Uniform differential item functioning (DIF) exists when the statistical relationship between item response and group is constant for all levels of a matching variable. Two other types of DIF are defined based on differences in item response functions (IRFs) among the groups of examinees: unidirectional DIF (the IRFs do not cross) and parallel DIF (the IRFs are the same shape but shifted from one another by a constant, i.e., the IRFs differ only in location). It is shown that these three types of DIF are not equivalent and the relationships among them are examined in this paper for two item response categories, two groups, and an ideal continuous univariate matching variable. The results imply that unidirectional and parallel DIF which have been considered uniform DIF by several authors are not uniform DIE For example, it is shown in this paper that parallel three-parameter logistic IRFs do not result in uniform DIE It is suggested that the term "uniform DIF" be reserved for the condition in which the association between the item response and group is constant for all values of the matching variable, as distinguished from parallel and unidirectional DIE Differential item functioning (DIF) refers to differences in the measurement properties of a test item among different groups of examinees. Many statistical techniques have been proposed to assess DIF (Millsap & Everson, 1993). This paper focuses on a particular type of DIF-uniform DIF. Uniform DIF exists when the statistical relationship between item response and group is constant for all levels of a matching variable (Mellenbergh, 1982). Two other types of DIF are defined based on differences in the item response functions (IRFs) among

73 citations


Journal ArticleDOI
TL;DR: In this article, the local dependence of item pairs is investigated via a conditional covariance function estimation procedure using kernel smoothing, and a standardization to adjust for the confounding effect of item difficulty is introduced.
Abstract: The local dependence of item pairs is investigated via a conditional covariance function estimation procedure. The conditioning variable used in the procedure is obtained by a monotonic transformation of total score on the remaining items. Intuitively, the conditioning variable corresponds to the unidimensional latent ability that is best measured by the test. The conditional covariance functions are estimated using kernel smoothing, and a standardization to adjust for the confounding effect of item difficulty is introduced. The particular standardization chosen is an adaptation of Yule's coefficient of colligation. Several models of local dependence are discussed to explain special situations, such as speededness and latent space multidimensionality, in which the assumptions of unidimensionality and local independence are violated.

63 citations


Journal ArticleDOI
TL;DR: The authors proposed a new regression correction using essentially a two-segment piecewise linear regression of the true on observed matching subtest scores, which adjusts for the Type I error-inflating and estimation-biasing influence of group target ability differences.
Abstract: One emphasis in the development and evaluation of SIBTEST has been the control of Type I error (false flagging of non-differential item functioning [DIF] items) inflation and estimation bias. SIBTEST has performed well in comparative simulation studies of Type I error and estimation bias relative to other procedures such as the Mantel-Haenszel and Logistic Regression. Nevertheless it has for a minority of cases that might occur in applications displayed sizable Type I error inflation and estimation bias.A vital part of SIBTEST is the regression correction, which adjusts for the Type I error-inflating and estimation-biasing influence of group target ability differences by using the linear regression of true on observed matching subtest scores from Classical Test Theory. In this paper, we propose a new regression correction, using essentially a two-segment piecewise linear regression of the true on observed matching subtest scores. A realistic simulation study of the new approach shows that when there is a ...

62 citations


Journal ArticleDOI
TL;DR: Bayesian analysis of hierarchically structured data with random intercept and heterogeneous within-group (Level-1) variance is presented in this article, where all parameters, including the Level-1 variance and intercept for each group, are based on their marginal posterior distributions approximated via the Gibbs sampler analysis.
Abstract: Bayesian analysis of hierarchically structured data with random intercept and heterogeneous within-group (Level-1) variance is presented. Inferences about all parameters, including the Level-1 variance and intercept for each group, are based on their marginal posterior distributions approximated via the Gibbs sampler Analysis of artificial data with varying degrees of heterogeneity and varying Level-2 sample sizes illustrates the likely benefits of using a Bayesian approach to model heterogeneity of variance (Bayes/Het). Results are compared to those based on now-standard restricted maximum likelihood with homogeneous Level-1 variance (RML/Hom). Bayes/Het provides sensible interval estimates for Level-1 variances and their heterogeneity, and, relatedly, for each group’s intercept. RML/Hom inferences about Level-2 regression coefficients appear surprisingly robust to heterogeneity, and conditions under which such robustness can be expected are discussed. Application is illustrated in a reanalysis of High S...

51 citations


Journal ArticleDOI
TL;DR: In this paper, an extension of the graded response model for the measurement of variability and change is presented, where an occasion-specific latent variable is decomposed into (a) a person-specific variable (a trait variable) and (b) an event-specific deviation variable measuring the variability caused by situational and/or interactional effects.
Abstract: An extension of the graded response model of Samejima (1969) for the measurement of variability and change is presented. In this model it is assumed that an occasion-specific latent variable is decomposed into (a) a person-specific variable (a trait variable) and (b) an occasion-specific deviation variable measuring the variability caused by situational and/or interactional effects. Furthermore, it is assumed that interindividual differences in intraindividual trait change occur between a priori specified periods of time. The correlations of the latent trait variables between periods of time indicate the degree of (trait) change. It is shown how the parameters of the model can be estimated and some implications of the model can be tested with structural equation models for ordered variables. Finally, the model is illustrated by an application to the measurement of students’ interest in the topic of radioactivity. Based on the results of a longitudinal study of students over 4’years, it is shown that a mod...

50 citations


Journal ArticleDOI
TL;DR: In this paper, the authors compared the power of the IGA and WJ tests for the within-subjects (trials) main effect and the groups x trials interaction when dispersion matrix equality is violated.
Abstract: Power for the improved general approximation (IGA) and Welch-James (WJ) tests of the within-subjects (trials) main effect and the within-subjects x between-subjects (groups x trial) interaction was estimated for a design with one between- and one within-subjects factor. The distribution of the data had two levels: multivariate normal and multivariate lognormal. Power estimates for conditions in which there were between-groups differences in dispersion matrices showed that, for both effects, there were conditions in which the IGA test was more powerful and conditions in which the WJ test was more powerful. The power advantage for the IGA test tended to be fairly small, whereas the power advantage for the WJ test was quite large in many conditions. Furthermore, the number of conditions favoring the WJ test was much larger than the number of conditions favoring the IGA test. Power for IGA, WJ, 9-adjusted, and MANOVA tests was compared for conditions in which dispersion matrices were equal across groups. Results indicate that little if any power was sacrificed by using WJ or IGA tests in place of MANOVA or i-adjusted tests. The split-plot design with one between-subjects (groups) factor and one within-subjects (trials) factor is commonly used in educational and psychological research. The purpose of the research reported in this article was to compare the power of the multivariate Welch-James (WJ; H. J. Keselman, Carriere, & Lix, 1993) and improved general approximation (IGA; Algina, 1994; Huynh, 1978) tests. These tests were designed to test the trials and the groups x trials effects when dispersion matrix equality is violated. Both the trials main effect and the groups x trials interaction can be tested by using univariate analysis of variance (ANOVA) or multivariate analysis of variance (MANOVA). The validity of these tests depends on the degree to which the data conform to the assumptions used in deriving the tests. Let J denote the number of levels in the groups factor, K the number of levels in the trials factor, and nj the number of subjects in level j of the groups factor, and let N = n, + - + nj. The data collected in the design can be viewed as realizations of J random

Journal ArticleDOI
TL;DR: The authors provided a set of conditions for the validity of inference for Item Response Theory (IRT) models applied to data collected from choice-based examinations, which are typical of those required in much more general settings.
Abstract: Examinations that permit students to choose a subset of the items are popular despite the potential that students may take examinations of varying difficulty as a result of their choices. We provide a set of conditions for the validity of inference for Item Response Theory (IRT) models applied to data collected from choice-based examinations. Valid likelihood and Bayesian inference using standard estimation methods require (except in extraordinary circumstances) that there is no dependence, after conditioning on the observed item responses, between the examinees choices and their (potential but unobserved) responses to omitted items, as well as their latent abilities. These independence assumptions are typical of those required in much more general settings. Common low-dimensional IRT models estimated by standard methods, though potentially useful tools for educational data, do not resolve the difficult problems posed by choice-based data.

Journal ArticleDOI
TL;DR: In this article, a procedure is presented for locating on the latent trait scale the scores (or responses) of items that follow the three-parameter logistic and monotone partial credit (MPC) models.
Abstract: A procedure is presented for locating on the latent trait scale the scores (or responses) of items that follow the three-parameter logistic (3PL) and monotone partial credit (MPC) models. The procedure is based on a Bayesian updating of the item information and is identical to locating the score at the latent trait value that maximizes the Bock score information. Applications are provided in terms of selecting items or score categories for criterion-referenced interpretation and mapping and analyzing score categories.

Journal ArticleDOI
TL;DR: In this article, a technique for applying the Rule Space model of cognitive diagnosis to assessment in a semantically-rich domain is presented, which consists of processes for constructing an initial representation of an item (labeled understand), forming goals and performing actions based on those goals, and determining whether goals have been attempted and satisfied (check).
Abstract: : This paper presents a technique for applying the Rule Space model of cognitive diagnosis to assessment in a semantically-rich domain. Responses to 22 architecture test items, developed to assess a range of architectural knowledge, were analyzed using Rule Space. Verbal protocol analyses guided the construction of a model of examinee performance, consisting of processes for constructing an initial representation of an item (labeled understand), forming goals and performing actions based on those goals (solve), and determining whether goals have been attempted and satisfied (check). Item attributes, derived from these processes, formed the basis for diagnosis. Our technique extends Rule Space's applicability by defining attributes in terms of item characteristics and the causal relations between characteristics and the problem-solving model. Data were collected from 122 architects of various ability levels (students, architecture interns, and professional architects). Rule Space successfully classified approximately 65%, 90%, and 40% of examinees based, respectively, on attributes associated with the understand, solve, and check processes of the problem-solving model. The findings support the effectiveness of Rule Space in a complex domain and suggest directions for developing new architecture items by using attributes particularly effective at distinguishing among examinees of different ability levels.

Journal ArticleDOI
TL;DR: In this article, a set of post hoc contrasts based on subsets of the treatment groups and simulating critical values from the appropriate multivariate F-distribution to be used in place of those associated with Scheffe's test is proposed.
Abstract: Scheffe’s test (Scheffe, 1953), which is commonly used to conduct post hoc contrasts among k group means, is unnecessarily conservative because it guards against an infinite number of potential post hoc contrasts when only a small set would ever be of interest to a researcher. This paper identifies a set of post hoc contrasts based on subsets of the treatment groups and simulates critical values from the appropriate multivariate F-distribution to be used in place of those associated with Scheffe’s test. The proposed method and its critical values provide a uniformly more powerful post hoc procedure.

Journal ArticleDOI
TL;DR: In this article, the authors considered the problem of generating a test from an item bank using a criterion based on classical test theory parameters and formulated a mathematical programming model that maximizes the reliability coefficient α, subject to logical constraints on the choice of items.
Abstract: This article considers the problem of generating a test from an item bank using a criterion based on classical test theory parameters A mathematical programming model is formulated that maximizes the reliability coefficient α, subject to logical constraints on the choice of items The special structure of the problem is exploited with network theory and Lagrangian relaxation techniques An empirical study shows that the method produces tests with high coefficient a subject to various practicable item constraints

Journal ArticleDOI
TL;DR: In this article, the problem of how to place students in a sequence of hierarchically related courses is addressed from a decision theory point of view, and it is shown that optimal mastery rules for the courses are always monotone and a nonincreasing function of the scores on the placement test.
Abstract: The problem of how to place students in a sequence of hierarchically related courses is addressed from a decision theory point of view. Based on a minimal set of assumptions, it is shown that optimal mastery rules for the courses are always monotone and a nonincreasing function of the scores on the placement test. On the other hand, placement rules are not generally monotone but have a form depending on the specific shape of the probability distributions and utility functions in force. The results are further explored for a class of linear utility functions.

Journal ArticleDOI
TL;DR: In this article, the use of ex-post (historical) simulation statistics as a means of evaluating latent variable growth models was considered, and the results illustrate the importance of using these measures as adjuncts to more traditional forms of model evaluation.
Abstract: This article considers the use of ex post (historical) simulation statistics as a means of evaluating latent variable growth models. Ex post simulation involves using the estimated parameters of a latent variable growth model to track the known historical values of an outcome of interest. Such methods of evaluating temporal models were developed primarily in applied economic forecasting and have been known for some time. This paper applies a variety of simulation quality statistics to latent variable growth models. In particular, Theil's (1966) inequality coefficient, bias proportion, variance proportion, and covariance proportion are used to gauge the simulation adequacy of growth models. An application to the study of change in science achievement using data from the Longitudinal Study of American Youth is provided to illustrate the methodology. The results illustrate the importance of using these measures as adjuncts to more traditional forms of model evaluation, especially if one is considering the use of these models for subsequent forecasting or other policy purposes.


Journal ArticleDOI
TL;DR: The first task of our mega-review was to settle on a set of books to review as mentioned in this paper, which included 14 books devoted entirely to meta-analysis or its application, aimed at various audiences, and with diverse purposes.
Abstract: The first task of our mega-review was to settle on a set of books to review. We have included 14 books devoted entirely to meta-analysis or its application, aimed at various audiences, and with diverse purposes. Meta-analysis (a term coined by Glass, 1976) refers to the quantitative synthesis of empirical study results; some scholars have also referred to this enterprise as \"research synthesis.\" We have excluded books not devoted primarily to meta-analysis (i.e., books with only a chapter or two on the topic). Our goal was not to critique in detail the contents of each book, but rather to provide an overview and comparison of the books, with the goal of creating a review that could be used by a range of individuals. A novice to the field will find advice on deciding where to start reading, depending on his or her reasons for exploring meta-analysis. Readers more familiar with the literature will find guidance in selecting from among these books to advance their understanding of particular topics in research synthesis. It is safe to say that not one of these books covers all topics in the area of meta-analysis. The vast (and growing) literature on meta-analysis, found in journals from statistics, medicine, and the social sciences, attests to that fact. However, these books have much to offer, especially to the novice in the field. We have grouped the books into three categories: instructional, methodological, and applications-oriented books. The instructional category comprises six books that cover the entire process of meta-analysis, generally in a nontechnical format. It also includes chapters in the The Handbook of Research Synthesis (Cooper & Hedges, 1994) not devoted to statistical methods. Three books in the methodological category primarily describe statistical methodology, paying little