Showing papers in &quot;Applied Psychological Measurement in 1992&quot;

A Method for Severely Constrained Item Selection in Adaptive Testing

TL;DR: The generalized partial credit model (GPCM) as discussed by the authors is a generalized PCM with a varying slope parameter, which is based on Andrich's (1978) rating scale formulation.

...read moreread less

Abstract: The partial credit model (PCM) with a varying slope parameter is developed and called the generalized partial credit model (GPCM). The item step parameter of this model is decomposed to a location and a threshold parameter, following Andrich's (1978) rating scale formulation. The EM algorithm for estimating the model parameters is derived. The performance of this generalized model is compared on both simulated and real data to a Rasch family of polytomous item response models. Simulated data were generated and then analyzed by the various polytomous item response models. The results demonstrate that the rating formulation of the GPCM is quite adaptable to the analysis of polytomous item responses. The real data used in this study consisted of the National Assessment of Educational Progress (Johnson & Allen, 1992) mathematics data that used both dichotomous and polytomous items. The PCM was applied to these data using both constant and varying slope parameters. The GPCM, which provides for varying slope pa...

...read moreread less

1,219 citations

Journal Article•DOI•

[...]

Martha L. Stocking¹, Len Swanson¹•Institutions (1)

Princeton University¹

Recovery of Marginal Maximum Likelihood Estimates in the Two-Parameter Logistic Response Model: An Evaluation of MULTILOG

TL;DR: A new method is presented for incorporating a large number of con straints on adaptive item selection and the meth odology emulates the test construction practices of expert test specialists, which is a necessity if com puterized adaptive testing is to compete with con ventional tests.

...read moreread less

Abstract: Previous attempts at incorporating expert test construction practices into computerized adaptive testing paradigms are described. A new method is presented for incorporating a large number of con straints on adaptive item selection. The meth odology emulates the test construction practices of expert test specialists, which is a necessity if com puterized adaptive testing is to compete with con ventional tests. Two examples—one for a verbal measure and the other for a quantitative measure— are provided of the successful use of the proposed method in designing adaptive tests.

...read moreread less

220 citations

Journal Article•DOI•

[...]

Clement A. Stone¹•Institutions (1)

University of Pittsburgh¹

A Method for Investigating the Intersection of Item Response Functions in Mokken's Nonparametric IRT Model.

TL;DR: In this paper, Monte Carlo methods were used to evaluate MML estimation of item param eters and maximum likelihood (ML) estimates of θ in the two-parameter logistic model for varying test lengths, sample sizes, and assumed θ dis tribution.

...read moreread less

Abstract: Marginal maximum likelihood (MML) estimation of the logistic response model assumes a structure for the distribution of ability (8). If this assump tion is incorrect, the statistical properties of MML estimates may not hold. Monte carlo methods were used to evaluate MML estimation of item param eters and maximum likelihood (ML) estimates of θ in the two-parameter logistic model for varying test lengths, sample sizes, and assumed θ dis tribution. 100 datasets were generated for each of the combinations of factors, allowing for item-level analyses based on means across replications. MML estimates of item difficulty were generally precise and stable in small samples, short tests, and under varying distributional assumptions of θ. When the true distribution of θ was normal, MML estimates of item discrimination were also gen erally precise and stable. ML estimates of θ were generally precise and stable, although the distribu tion of θ estimates was platykurtic and truncated at the high and low ends of the scor...

...read moreread less

101 citations

Journal Article•DOI•

[...]

Klaas Sijtsma¹, Rob R. Meijer¹•Institutions (1)

VU University Amsterdam¹

Equating Tests Under the Graded Response Model

TL;DR: For a set of k items having nonintersecting item response functions (IRFS), the H coefficient (Loevinger, 1948; Mokken, 1971) applied to a transposed persons by items binary matrix HT has a non-negative value as discussed by the authors.

...read moreread less

Abstract: For a set of k items having nonintersecting item response functions (IRFS), the H coefficient (Loevinger, 1948; Mokken, 1971) applied to a transposed persons by items binary matrix HT has a non-negative value. Based on this result, a method is proposed for using HT to investigate whether a set of IRFS intersect. Results from a monte carlo study support the proposed use of HT. These results support the use of HT as an exten sion to Mokken's nonparametric item response theory approach.

...read moreread less

90 citations

Journal Article•DOI•

[...]

Frank B. Baker¹•Institutions (1)

University of Wisconsin-Madison¹

A Conceptual Analysis of Differential Item Functioning in Terms of a Multidimensional Item Response Model

TL;DR: In this article, the Stocking and Lord (1983) procedure for computing equating coefficients for tests having dichotomously scored items is extended to the case of graded response items.

...read moreread less

Abstract: The Stocking and Lord (1983) procedure for computing equating coefficients for tests having dichotomously scored items is extended to the case of graded response items. A system of equations for obtaining the equating coefficients under Samejima's (1969, 1972) graded response model is derived. These equations are used to compute equating coefficients in two related situations. Under the first, the equating coefficients are obtained by matching, on an examinee by examinee basis, the true scores on two tests. In the second case, the equating coefficients are obtained by matching the test characteristic curves (TCCS) of the two tests. Several examples of computing equating coefficients in these two situations are provided. The TCC matching ap proach was much less demanding computationally and yielded equating coefficients that differed little from those obtained through the true score distribution matching approach.

...read moreread less

89 citations

Journal Article•DOI•

[...]

Gregory Camilli¹•Institutions (1)

Rutgers University¹

The ordered partition model: An extension of the partial credit model

TL;DR: Differential item functioning (DIF) has been informally conceptualized as multidimensionality as mentioned in this paper, which assumes that DIF is not a difference in the item parameters of two groups; rather, it is a shift in the distribution of ability along a secondary trait that influences the probability of a correct item response.

...read moreread less

Abstract: Differential item functioning (DIF) has been informally conceptualized as multidimensionality. Recently, more formal descriptions of DIF as multidimensionality have become available in the item response theory literature. This approach assumes that DIF is not a difference in the item parameters of two groups; rather, it is a shift in the distribution of ability along a secondary trait that influences the probability of a correct item response. That is, one group is relatively more able on an ability such as test-wiseness. The parameters of the secondary distribution are confounded with item parameters by unidimensional DIF detection models, and this manifests as differences between estimated item parameters. However, DIF is con founded with impact in multidimensional tests, which may be a serious limitation of unidimen sional detection methods in some situations. In the multidimensional approach, DIF is considered to be a function of the educational histories of the examinees. Thus, a better tool for unde...

...read moreread less

88 citations

Journal Article•DOI•

[...]

Mark Wilson¹•Institutions (1)

University of California, Berkeley¹

Effect of sample size, number of biased items, and magnitude of bias on a two-stage item bias estimation method

TL;DR: The ordered partition model as mentioned in this paper is designed for a measurement context in which the categories of response to an item cannot be completely ordered, so that an examiner may want to maintain the distinction between the strategies.

...read moreread less

Abstract: An item response model, called the ordered.partition model, is designed for a measurement context in which the categories of response to an item cannot be completely ordered. For example, two different solution strategies may lead to an equivalent degree of success because both strategies may result in the same score, but an examiner may want to maintain the distinction between the strategies. Thus, the data would not be nominal nor completely ordered, so may not be suitable for other polytomous item response models such as the partial credit or the graded response models. The ordered partition model is described as an extension of the partial credit model, its relationship to other models is discussed, and two examples are presented. © 1992 Sage Publications. All rights reserved.

...read moreread less

75 citations

Journal Article•DOI•

[...]

M. David Miller¹, T. C. Oshima²•Institutions (2)

University of Florida¹, Georgia State University²

Inferential Conditions in the Statistical Detection of Measurement Bias

TL;DR: In this article, a two-stage procedure for estimating item bias was examined with six indexes of item bias and with the Mantel-Haenszel (MH) statistic; the sample size, the number of biased items, and the magnitude of the bias were varied.

...read moreread less

Abstract: A two-stage procedure for estimating item bias was examined with six indexes of item bias and with the Mantel-Haenszel (MH) statistic; the sample size, the number of biased items, and the magnitude of the bias were varied. The second stage of the procedure did not identify substantial numbers of false positives (unbiased items identified as biased). However, the identification of true positives in the second stage was useful only when the magnitude of the bias was not small and the number of biased items was large (20% or 40% of the test). The weighted indexes tended to identify more true and false positives than their unweighted item response theory counterparts. Finally, the MH statistic identified fewer false positives, but did not identify small bias as well as the item response theory indexes

...read moreread less

73 citations

Journal Article•DOI•

[...]

Roger E. Millsap¹, William Meredith²•Institutions (2)

City University of New York¹, University of California, Berkeley²

Multidimensionality and item bias in item response theory

TL;DR: Bias in an observed variable Y as a measure of an unobserved variable W exists when the relationship of Y to W varies among popula tions of interest as mentioned in this paper, and bias is often studied by examin ing...

...read moreread less

Abstract: Measurement bias in an observed variable Y as a measure of an unobserved variable W exists when the relationship of Y to W varies among popula tions of interest. Bias is often studied by examin ing...

...read moreread less

64 citations

Journal Article•DOI•

[...]

T. C. Oshima¹, M. David Miller²•Institutions (2)

Georgia State University¹, University of Florida²

A review of regression diagnostics for behavioral research

TL;DR: In this article, the authors demonstrate empirically how item bias indexes based on item response theory (IRT) identify bias that results from multidimensionality, when a test is multidimensional (MD) with a primary trait and a nuisance trait that affects a small portion of the test.

...read moreread less

Abstract: This paper demonstrates empirically how item bias indexes based on item response theory (IRT) identify bias that results from multidimensionality. When a test is multidimensional (MD) with a primary trait and a nuisance trait that affects a small portion of the test, item bias is defined as a mean difference on the nuisance trait between two groups. Results from a simulation study showed that although IRT-based bias indexes clearly distinguished multidimensionality from item bias, even with the presence of a between-group dif ference on the primary trait, the bias detection rate depended on the degree to which the item measured the nuisance trait, the values of MD discrimination, and the number of MD items. It was speculated that bias defined from the MD perspective was more likely to be detected when the test data met the essential unidimensionality assumption. Index

...read moreread less

62 citations

Journal Article•DOI•

[...]

Sangit Chatterjee¹, Mustafa Yilmaz¹•Institutions (1)

Northeastern University¹

Effects of response format on diagnostic assessment of scholastic achievement

TL;DR: In this article, the importance of regression diagnostics in detecting influential points is discussed, and five statistics are recommended for the applied researcher, and the suggested diagnostics were used on a small dataset to detect an influen tial data point and the effects were analyzed.

...read moreread less

Abstract: Influential data points can affect the results of a regression analysis; for example, the usual sum mary statistics and tests of significance may be misleading. The importance of regression diagnostics in detecting influential points is discussed, and five statistics are recommended for the applied researcher. The suggested diagnostics were used on a small dataset to detect an influen tial data point, and the effects were analyzed. Colinearity-based diagnostics also are discussed and illustrated on the same dataset. The non- robustness of the least squares estimates in the presence of influential points is emphasized. Diagnostics for multiple influential points, multi variate regression, multicolinearity, nonlinear regression, and other multivariate procedures also are discussed.

...read moreread less

Journal Article•DOI•

[...]

Menucha Birenbaum¹, Kikumi K. Tatsuoka, Yaffa Gutvirtz•Institutions (1)

Tel Aviv University¹

Analyzing Test Content Using Cluster Analysis and Multidimensional Scaling

TL;DR: In this article, the effect of response format on diagnostic assessment of students' performance on an algebra test was investigated using two diagnostic approaches: a ''bug» analysis and a rule-space analysis.

...read moreread less

Abstract: The effect of response format on diagnostic assessment of students' performance on an algebra test was investigated. Two sets of parallel, openended (OE) items and a set of multiple-choice (MC) items―which were stem-equivalent to one of the OE item sets―were compared using two diagnostic approaches: a «bug» analysis and a rule-space analysis. Items with identical format (parallel OE items) were more similar than items with different formats (OE VS. MC)

...read moreread less

Journal Article•DOI•

[...]

Stephen G. Sireci¹, Kurt F. Geisinger¹•Institutions (1)

Fordham University¹

Replication as a Rule for Determining the Number of Clusters in Hierarchial Cluster Analysis

TL;DR: In this paper, the similarity data were analyzed using a multidimensional scaling (MDS) procedure followed by a hierarchical cluster analysis of the MDS stimulus coordinates, and the results indicated a strong correspondence between similarity data and the arrangement of items as prescribed in the test blueprint.

...read moreread less

Abstract: A new method for evaluating the content representation of a test is illustrated. Item similari ty ratings were obtained from content domain ex perts in order to assess whether their ratings cor responded to item groupings specified in the test blueprint. Three expert judges rated the similarity of items on a 30-item multiple-choice test of study skills. The similarity data were analyzed using a multidimensional scaling (MDS) procedure followed by a hierarchical cluster analysis of the MDS stimulus coordinates. The results indicated a strong correspondence between the similarity data and the arrangement of items as prescribed in the test blueprint. The findings suggest that analyzing item similarity data with MDS and cluster analysis can provide substantive information pertaining to the content representation of a test. The advantages and disadvantages of using MDS and cluster analysis with item similarity data are discussed.

...read moreread less

Journal Article•DOI•

[...]

John E. Overall¹, Kevin N. Magee¹•Institutions (1)

University of Texas at Austin¹

The effect of review on student ability and test efficiency for computerized adaptive tests

TL;DR: In this paper, a single higher-order cluster analysis is used to group cluster mean profiles derived from several preliminary analyses, and the results are confirmed when each higher order cluster contains one clu...

...read moreread less

Abstract: A single higher-order cluster analysis can be used to group cluster mean profiles derived from several preliminary analyses. Replication is confirmed when each higher-order cluster contains one clu...

...read moreread less

Journal Article•DOI•

[...]

Mary E. Lunz¹, Betty A. Bergstrom¹, Benjamin D. Wright²•Institutions (2)

American Society for Clinical Pathology¹, University of Chicago²

Item Selection Using an Average Growth Approximation of Target Information Functions

TL;DR: In this article, the effect of reviewing items and altering responses on the efficiency of computerized adaptive tests and the resultant ability estimates of examinees was explored, and the average efficiency of the test was decreased by 1% after review.

...read moreread less

Abstract: The effect of reviewing items and altering responses on the efficiency of computerized adap tive tests and the resultant ability estimates of examinees were explored. 220 students were ran domly assigned to a review condition; their test instructions indicated that each item must be answered when presented, but that the responses could be reviewed and altered at the end of the test. A sample of 492 students did not have the opportunity to review and alter responses. Within the review condition, examinee ability estimates before and after review were correlated .98. The average efficiency of the test was decreased by 1% after review. Approximately 32% of the examinees improved their ability estimates after review, but did not change their pass/fail status. Disallowing review on adaptive tests administered under these rules is not supported by these data.

...read moreread less

Journal Article•DOI•

[...]

Richard M. Luecht¹, Thomas M. Hirsch¹•Institutions (1)

The American College of Financial Services¹

Methods and models for the construction of weakly parallel tests

TL;DR: The derivations of several item selection algorithms for use in fitting test items to target information functions (IFS) are described, indicating that the algorithms pro vided reliable fit to the target in terms of item parameters, test information functions, and expected score distributions.

...read moreread less

Abstract: The derivations of several item selection algorithms for use in fitting test items to target information functions (IFS) are described. These algorithms circumvent iterative solutions by using the criteria of moving averages of the distance to a target IF and by simultaneously considering an entire range of ability points used to condition the IFS. The algorithms were tested by generating six forms of an ACT math test, each fit to an existing target test, including content-designated item sub sets. The results indicate that the algorithms pro vided reliable fit to the target in terms of item parameters, test information functions, and expected score distributions.

...read moreread less

Journal Article•DOI•

[...]

Jos J. Adema¹•Institutions (1)

University of Twente¹

Unidimensional Calibrations and Interpretations of Composite Traits for Multidimensional Tests

TL;DR: In this article, a mathematical programing model for constructing tests with a prespecified test information function and a heuristic for assigning items to tests such that their information functions are equal play an important role in the methods.

...read moreread less

Abstract: Methods are proposed for the construction of weakly parallel tests, that is, tests with the same test information function. A mathematical programing model for constructing tests with a prespecified test information function and a heuristic for assigning items to tests such that their information functions are equal play an important role in the methods. The MI and MIDI methods are proposed for constructing tests with a prespecified test information function applying the Minimax model. Similar methods, MAMI and MADI, are provided for construction of a weakly parallel test approximately equal with respect to the Maximin criterion. The four methods were applied on a real item bank of 600 items from college placement mathematics tests (520 items were from 13 previously administered American College Testing Assessment Program tests, and 80 were from the Collegiate Mathematics Placement Program). The numerical examples indicated that the tests were constructed quickly and that the heuristic gave good results. However, the heuristic was not applicable for every set of practical constraints (i.e., constraints with respect to test administration time, test composition, or dependencies between items). Four tables and four graphs present information about the constructed tests.

...read moreread less

Journal Article•DOI•

[...]

Richard M. Luecht¹, Timothy R. Miller¹•Institutions (1)

The American College of Financial Services¹

The nominal response model in computerized adaptive testing

TL;DR: In this article, a two-stage process that considers the multi dimensionality of tests under the framework of unidimensional item response theory (IRT) is described and evaluated, and items are clustered in the first stage.

...read moreread less

Abstract: A two-stage process that considers the multi dimensionality of tests under the framework of unidimensional item response theory (IRT) is described and evaluated. In the first stage, items are clust...

...read moreread less

Journal Article•DOI•

[...]

R. J. De Ayala¹•Institutions (1)

University of Maryland, College Park¹

Assessing essential unidimensionality of real data

TL;DR: In this article, the nominal response model (NR CAT) was used for adaptive test. And the performance of the NR CAT and a CAT based on the three-parameter logistic (3PL) model was compared.

...read moreread less

Abstract: Although most computerized adaptive tests (CATS) use dichotomous item response theory (IRT) models, research on the use of polytomous IRT models in CAT has shown promising results. This study implemented a CAT based on the nominal response model (NR CAT). Item pool requirements for the NR CAT were examined. The performance of the NR CAT and a CAT based on the three-parameter logistic (3PL) model was compared. For two-, three-, and four-category items, items with maximum information of at least.16 produced reasonably accurate trait estimation for tests with a minimum test length of approximately 15 to 20 items. The NR CAT was able to produce trait estimates comparable to those of the 3PL CAT. Implications of these results are discussed

...read moreread less

Journal Article•DOI•

[...]

Ratna Nandakumar¹•Institutions (1)

University of Delaware¹

05 Aug 1992-Applied Psychological Measurement

TL;DR: In this article, the capability of DIMTEST in assessing essential unidimensionality of item responses to real tests was investigated, and it was found that some test data fit an essentially unidimensional model and did not fit in an essentially non-uniform model.

...read moreread less

Abstract: The capability of DIMTEST in assessing essential unidimensionality of item responses to real tests was investigated. DIMTEST found that some test data fit an essentially unidimensional model and ot...

...read moreread less

Journal Article•DOI•

Test of the Hypothesis that the Intraclass Reliability Coefficient Is the Same for Two Measurement Procedures.

[...]

Yousef M. Alsawalmeh¹, Leonard S. Feldt•Institutions (1)

Yarmouk University¹

Testing hypotheses about methods, traits, and communalities in the direct-product model

TL;DR: In this article, an approximate statistical test is derived for the hypothesis that the intraclass reliability coefficients associated with two measurement procedures are equal, and control of Type 1 error is investigated by comparing empirical sampling distributions of the test statistic with its derived theoretical distribu tion.

...read moreread less

Abstract: An approximate statistical test is derived for the hypothesis that the intraclass reliability coefficients associated with two measurement procedures are equal. Control of Type 1 error is investigated by comparing empirical sampling distributions of the test statistic with its derived theoretical distribu tion. A numerical illustration of the procedure is also presented.

...read moreread less

Journal Article•DOI•

[...]

Richard P. Bagozzi¹, Youjae Yi¹•Institutions (1)

University of Michigan¹

IRTDIF: A Computer Program for IRT Differential Item Functioning Analysis:

TL;DR: The direct-product model has been suggested as a procedure for estimating multiplicative effects of traits and methods in multitrait-multimethod matrices as discussed by the authors, which has been extended in two ways: first, hierarchically nested models are derived for explicitly testing the overall and specific patterns of method and trait factors.

...read moreread less

Abstract: The direct-product model has been suggested as a procedure for estimating multiplicative effects of traits and methods in multitrait-multimethod matrices. Research on the direct-product model is extended in two ways. First, hierarchically nested models are derived for explicitly testing the overall and specific patterns of method and trait factors. Second, formal tests are developed for the pattern of communalities. These procedures are illustrated with data from Lawler (1967)

...read moreread less

Journal Article•DOI•

[...]

Seock-Ho Kim¹, Allan S. Cohen¹•Institutions (1)

University of Wisconsin-Madison¹

DIMTEST: A Fortran Program for Assessing Dimensionality of Binary Item Responses:

TL;DR: The IRTDIF program was written in IBM Professional FORTRAN for IBM and compatible personal computers and uses subroutines taken from Numerical Recipes to compute the percentage points of the incomplete gamma functions.

...read moreread less

Abstract: IRT models. To compute the DIF measures and the statistics to test the significance of the DIF measures, IRTDIF uses two files. One file contains sets of item parameter estimates; the other contains the sampling variance-covariance matrices. Significance levels (p values) are provided for Lord’s x2 and the exact area measures. When the sampling variance-covariance matrices are not available, the exact and closedinterval area measures are provided without statistical significance tests. The program was written in IBM Professional FORTRAN for IBM and compatible personal computers and uses subroutines taken from Numerical Recipes (Press, Flannery, Teukolsky, & Vetterling, 1986) to compute the percentage points of the incomplete gamma functions. Execution of the program requires a numerical coprocessor.

...read moreread less

Journal Article•DOI•

[...]

William Stout¹, Ratna Nandakumar¹, Brian W. Junker¹, Hua Hua Chang, Duane Steidinger¹ - Show less +1 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

Polynomial algorithms for item matching

TL;DR: DIMTEST tests the hypothesis that the model underlying a matrix of binary item responses, generated by administering a test to a specific examinee population, is essentially unidimensional.

...read moreread less

Abstract: DIMTEST is a statistical test developed by Stout (1987), and refined by Nandakumar & Stout (in press; see also Nandakumar, in press). DIMTEST tests the hypothesis that the model underlying a matrix of binary item responses, generated by administering a test to a specific examinee population, is essentially unidimensional (essential dimensionality is a mathematical formulation of the existence of one dominant latent dimension).

...read moreread less

Journal Article•DOI•

[...]

Ronald D. Armstrong, Douglas H. Jones

01 Jan 1992-Applied Psychological Measurement

TL;DR: To estimate test reliability and to create parallel tests, test items frequently are matched and algorithms are presented based on optimization theory in networks (graphs) and have polynomial complexity.

...read moreread less

Abstract: To estimate test reliability and to create parallel tests, test items frequently are matched. Items can be matched by splitting tests into parallel test halves, by creating T splits, or by matching a desired test form. Problems often occur. Algorithms are presented to solve these problems. The algorithms are based on optimization theory in networks (graphs) and have polynomial complexity. Computational results from solving sample problems with several hundred decision variables are reported

...read moreread less

Journal Article•DOI•

Seriation and multidimensional scaling: a data analysis approach to scaling asymmetric proximity matrices

[...]

Joseph Lee Rodgers¹, Tony D. Thompson¹•Institutions (1)

University of Oklahoma¹

The effect of test length and IRT model on the distribution and stability of three appropriateness indexes

TL;DR: A flexible data analysis approach is proposed that combines two psychometric procedures— seriation and multidimensional scaling (MDS) that is particularly appropriate for the analysis of proximities containing temporal information.

...read moreread less

Abstract: A number of model-based scaling methods have been developed that apply to asymmetric proximity matrices. A flexible data analysis approach is pro posed that combines two psychometric procedures— seriation and multidimensional scaling (MDS). The method uses seriation to define an empirical order ing of the stimuli, and then uses MDS to scale the two separate triangles of the proximity matrix defined by this ordering. The MDS solution con tains directed distances, which define an "extra" dimension that would not otherwise be portrayed, because the dimension comes from relations between the two triangles rather than within triangles. The method is particularly appropriate for the analysis of proximities containing temporal information. A major difficulty is the computa tional intensity of existing seriation algorithms, which is handled by defining a nonmetric seriation algorithm that requires only one complete itera tion. The procedure is illustrated using a matrix of co-citations between recent presidents o...

...read moreread less

Journal Article•DOI•

[...]

Brian W. Noonan, Marvin W. Boss¹, Marc E. Gessaroli¹•Institutions (1)

University of Ottawa¹

Estimating individual rater reliabilities

TL;DR: In this article, the authors investigated the effect of test length and item response theory (IRT) model and test length on the distribution of the appropriateness indexes and their cutoff values at three false positive rates.

...read moreread less

Abstract: The extent to which three appropriateness indexes - Z 3 , ECIZ4, and W (a variation of Wright's person-fit statistic) - are well-standardized was investigated in a monte carlo study. To assess the effects of the item response theory (IRT) model and test length on the distribution of the indexes and their cutoff values at three false positive rates, nonaberrant response patterns were generated. ECIZ4 most closely approximated a normal distribution, showing less skewness and kurtosis than Z 3 , and W. The ECIZ4 cutoff values were affected less by test length and the IRT model than were Z 3 , and W. In contrast, the distribution of W was the least stable over replications, and its cutoff values varied greatly depending on the IRT model and test length

...read moreread less

Journal Article•DOI•

[...]

John E. Overall¹, Kevin N. Magee¹•Institutions (1)

University of Texas at Austin¹

Using the extreme groups strategy when measures are not normally distributed

TL;DR: In this article, an external measure can replace one of the raters, and individual reliabilities of two independent raters can be estimated, in a somewhat similar fashion, estimates of treatment effects present in ratings by two independent rater can provide the external frame of reference against which dif ferences in their individual ratings can be evaluated.

...read moreread less

Abstract: Rating scales have no inherent reliability that is independent of the observers who use them. The often reported interrater reliability is an average of perhaps quite different individual rater reliabilities. It is possible to separate out the individual rater reliabilities given a number of independent raters who observe the same sample of ratees. Under cer tain assumptions, an external measure can replace one of the raters, and individual reliabilities of two independent raters can be estimated. In a somewhat similar fashion, estimates of treatment effects present in ratings by two independent raters can provide the external frame of reference against which dif ferences in their individual reliabilities can be evaluated. Models for estimating individual rater reliabilities are provided for use in selecting, evalu ating, and training participants in clinical research.

...read moreread less

Journal Article•DOI•

[...]

Robert L. Fowler¹•Institutions (1)

University of South Florida¹

Measuring the difference between two models

TL;DR: The extreme groups research strategy is a two-stage measurement procedure that may be employed when it is relatively simple and inexpen sive to obtain data on a psychological variable (X) in the first stage of investigation, but it is quite complex and expensive to measure subsequently a second variable (Y) as discussed by the authors.

...read moreread less

Abstract: The extreme groups research strategy is a two- stage measurement procedure that may be employed when it is relatively simple and inexpen sive to obtain data on a psychological variable (X) in the first stage of investigation, but it is quite complex and expensive to measure subsequently a second variable (Y). This strategy is related to the selection of upper and lower groups for item dis crimination analysis (Kelley, 1939) and to the treatments x blocks design in which participants are first "blocked" on the X variable and then only the extreme (highest and lowest means) blocks are compared on the Y variable, usually by a t test or an analysis of variance. Feldt (1961) showed analytically that if the population correlation coefficient between X and Y is p = .10, the power of the t test is maximized if each extreme group consists of 27% of the population tested on the X variable. However, Feldt's derivation assumes that the X and Y variables are normally distributed. The present study employed a monte car...

...read moreread less

Journal Article•DOI•

[...]

Michael V. Levine¹, Fritz Drasgow, Bruce A. Williams¹, Christopher McCusker¹, Gary L. Thomasson¹ - Show less +1 more•Institutions (1)

University of Illinois at Urbana–Champaign¹