This paper describes four commonly used designs in equating test scores. These designs are: (1) single-group; (2) random-group; (3) equivalent-group; and (4) anchor-test. Each design requires that its data be collected according to specific guidelines. Three of the four methods are illustrated through hypothetical examples. All four methods try to equate test scores from equally reliable and parallel measures. Although the anchor-test design is not as simple to implement as the other designs,...

This paper studies whether equating results can be improved if the variable that accounts for all systematic differences between equating populations is identified and used as an anchor in anchor test design or as a variable on which to match equating samples. The sample invariant properties of four anchor test equating methods (Tucker and Levine equally reliable linear models, chained equipercentile, and frequency estimation equipercentile models) were examined under representative,...

Test disclosure legislation in New York State (LaValle Act) has had a major impact on the national testing programs administered by Educational Testing Services (ETS) for various sponsoring organizations. The paper reviews the immediate operational effects of test disclosure in the following areas: (1) increase in number of test forms developed; (2) acceleration of development of new equating methods; and (3) filing requirements and interpretive materials. Possible future changes in national...

This paper discusses loglinear models for assessing differential item functioning (DIF). Loglinear and logit models that have been suggested for studying DIF are reviewed, and loglinear formulations of the logit models are given. A polynomial loglinear model for assessing DIF is introduced. Two examples using the polynomial loglinear model for investigating DIF are discussed. One example investigates DIF for a test consisting of both dichotomous and polytomous items. Another example illustrates...

This paper discusses the four major types of test equating: (1) mean; (2) linear; (3) equipercentile; and (4) item response theory. The single-group, equivalent-group, and anchor-test data collection designs are presented as methods used for test equating. Issues related to assumptions and equating error are also addressed. The advantages and disadvantages of each equating method are discussed along with the conditions conducive to satisfactory equating. Research on the current interest in...

The purpose of this study was to demonstrate that the choice of sample weights when defining the target population under poststratification equating can be a critical factor in determining the accuracy of the equating results under a unique equating scenario, known as "rater comparability scoring and equating." The nature of data collection under "rater comparability scoring" is such that it results in a very high correlation between the anchor and total score in the new...

The quality of nonequivalent group equating by the one-parameter hierarchical generalized linear logistic model (1-P HGLLM) was examined by comparing it with: (1) traditional concurrent equating; (2) Stocking-Lord's method; and (3) multiple-group concurrent equating. Root mean squared errors (RMSEs) for item parameters indicated that there was no prominent difference among the four equating methods, and none of the four methods was consistently better than other methods across the entire item...

There are test-equating situations in which it may be appropriate to fit a loglinear or other type of probability model to the joint distribution of a total score on a test and a score on part of that test. For anchor test designs, this situation arises for internal anchor tests, which are embedded within the total test. Similarly, a part-whole relationship arises between two scores when a few test items are dropped from a test and a single group design is used to equate the scores of the full...

The purpose of this study is to determine the extent of scale drift on a test that employs cut scores. It is essential to examine scale drift in a testing program using new forms that are often put on scale through a series of intermediate equatings (known as equating chains). This may cause equating error to accumulate to a point where scale scores are rendered incomparable across two parallel chains or time periods. The study examined whether scale drift occurred for two conditions (i.e.,...

A calibration of the Armed Forces Qualification Test (AFQT) composite of the Armed Services Vocational Aptitude Battery (ASVAB) Forms 8a, 8b, 9a, 9b, 10a, and 10b to the metric of the AFQT Form 7a (AFQT-7a) and a comparison of these outcomes to the operational calibration tables implemented 1 October 1980 are presented. A sample of applicants for military enlistment was administered one form of ASVAB and the AFQT-7a in counterbalanced order. For analytic purposes, an edited sample (15,115...

This paper presents a new equating method for the nonequivalent groups with anchor test design: poststratification equating based on true anchor scores. The linear version of this method is shown to be equivalent, under certain conditions, to Levine observed score equating, in the same way that the linear version of poststratification equating is equivalent to Tucker equating. Some issues related to this result are discussed.

In this paper, the "standard error of equating difference" (SEED) is described in terms of originally proposed kernel equating functions (von Davier, Holland, & Thayer, 2004) and extended to incorporate traditional linear and equipercentile functions. These derivations expand on prior developments of SEEDs and standard errors of equating and provide additional insight about the relationships of kernel and traditional equating functions. Simulations are used to evaluate the SEEDs'...

The "single group with nearly equivalent tests" (SiGNET) design proposed here was developed to address the problem of equating scores on multiple-choice test forms with very small single-administration samples. In this design, the majority of items in each new test form consist of items from the previous form, and the new items that were administered as unscored items in the previous form. Each form is equated using data from examinees who took the previous form. As the equating is a...

In this paper, we develop a new chained equipercentile equating procedure for the nonequivalent groups with anchor test (NEAT) design under the assumptions of the classical test theory model. This new equating is named chained true score equipercentile equating. We also apply the kernel equating framework to this equating design, resulting in a family of chained true score equipercentile equating functions, which include the Levine true score equating model as a special case. (Contains 2...

This research derives simplified formulas for computing the standard error of the frequency estimation method for equating score distributions that are continuized using a uniform or Gaussian kernel function (P. W. Holland, B. F. King, and D. T. Thayer, 1989; Holland and Thayer, 1987). The simplified formulas are applicable to equating both the observed- and smoothed-score distributions (P. R. Rosenbaum and D. Thayer, 1987). Two empirical studies investigated the use of the simplified formulas....

The equating of scores on alternate forms of different achievement tests through the use of the three-parameter latent trait model, item-response theory (IRT) equating, was compared with the results of score equatings based on conventional linear and curvilinear equating models. Ten equatings were completed for pairs of alternate forms of the Advanced Placement Program, which measures different content areas and traits in each subject area. It was found that despite the apparent violation of...

The purposes of this paper are five-fold to discuss: (1) when item response theory (IRT) equating methods should provide better results than traditional methods; (2) which IRT model, the three-parameter logistic or the one-parameter logistic (Rasch), is the most reasonable to use; (3) what unique contributions IRT methods can offer the equating process; (4) what work has been done that relates to the confidence that can be placed in the IRT equating results; and (5) what unresolved issues exist...

A regression procedure is developed to link simultaneously a very large number of item response theory (IRT) parameter estimates obtained from a large number of test forms, where each form has been separately calibrated and where forms can be linked on a pairwise basis by means of common items. An application is made to forms in which a two-parameter logistic model is applied to dichotomous items and a general partial credit model is applied to polytomous items.

Based on the experiences of four equating studies conducted by the Austin (Texas) Independent School District, a practical "cookbook" approach to test equating is presented. Three types of equating procedures are discussed: choosing a cutoff score on a new instrument, predicting Y from X, and symmetric equating of X and Y. (BW)

The feasibility of using linear and equipercentile equating methods (W. H. Angoff, 1984) to equate forms of the Test of Written English (TWE) by using the Test of English as a Foreign Language (TOEFL) as an anchor was explored. These two equating methods assume that either the TOEFL test and TWE test measure the same skills or that the examinee groups across TWE administrations are equivalent in skills. The differences between equated and observed scores (equating residuals) and differences...

Recently, the literature has seen increasing interest in subscores for their potential diagnostic values; for example, one study suggested the report of weighted averages of a subscore and the total score, whereas others showed, for various operational and simulated data sets, that weighted averages, as compared to subscores, lead to more accurate diagnostic information. To report weighted averages, the averages should be comparable across different test forms; that is, the averages should be...

The purpose of this paper is to extend von Davier, Holland, and Thayer's (2004b) framework of kernel equating so that it can incorporate raw data and traditional equipercentile equating methods. One result of this more general framework is that previous equating methodology research can be viewed more comprehensively. Another result is that the standard error of equated score difference (SEED) has a wider application than originally proposed. The methods described in this paper are empirically...

Continuous exponential families may be employed to find continuous distributions with the same initial moments as the discrete distributions encountered in typical applications of classical equating. These continuous distributions provide distribution functions and quantile functions that may be employed in equating. To illustrate, an application is considered for a randomly equivalent groups design.

Nine statistical strategies for selecting equating functions in an equivalent groups design were evaluated. The strategies of interest were likelihood ratio chi-square tests, regression tests, Kolmogorov-Smirnov tests, and significance tests for equated score differences. The most accurate strategies in the study were the likelihood ratio tests and the significance tests for equated score differences.

Section Pre-Equating (SPE) is a method used to equate test forms that consist of multiple separately timed sections. SPE does not require examinees to take two complete forms of the test. Instead, all of the old form and one or two sections of the new form are administered to each examinee, and missing data techniques are employed to estimate the necessary equating parameters. When a test includes only one variable section, there is no simple way to obtain an estimate of the correction between...

Because the demand for subscores is ever increasing, this study examined two different approaches for equating subscores: (a) equating a subscore on the new form to the same subscore in the old form using internal common items as the anchor to conduct the equating, and (b) equating a subscore on the new form to the same subscore in the old form using equated total scores as the anchor to conduct the equating. Equated total scores can be used as an anchor to equate the subscores because the...

Procedures used to compare the results from item response theory as well as more traditional equating methods were described and critically analyzed. The implications of the comparison of equipercentile, linear, one-parameter (Rasch), and three-parameter methods for equating twelve forms of each of the five tests of General Educational Development (GED) were discussed. The use of factor analyses to assess test dimensionality, examination of equating curves, examination of item parameter...

This study investigates the population sensitivity of the commonly used linear equating methods in the Non-Equivalent-groups with an Anchor Test (NEAT) design: the Tucker, the Levine observed-score, and the chain linear methods. For a detailed analysis of the subject, we apply three distinctive approaches to a real data set from a NEAT design: a) the RMSD index for the NEAT design of von Davier, Holland, and Thayer (2004); b) the parallel-linking system of Dorans and Holland (2000), and c) the...

The Non-Equivalent-groups Anchor Test (NEAT) design involves two populations, "P" and "Q," of test takes and makes use of an anchor test to link them. Two observed-score equating methods used for NEAT designs are those based on chain equating and those using the anchor to poststratify the distributions of the two operational test scores to a common population, i.e., Tucker equating and frequency estimation. This paper introduced a method that can be used in the NEAT design...

This paper suggests two new, related methods for estimating a test-score equating relationship from small samples of test takers. These methods do not require the estimated equating transformation to be linear. Instead, they constrain the estimated equating curve to pass through 2 prespecified end-points and a middle point determined from the data. Some preliminary results indicate that these methods outperform mean equating and other methods used for equating in small samples.

Item response theory (IRT) item parameters can be estimated using data from a common item equating design either separately for each form or concurrently across forms. This paper reports the results of a simulation study of separate versus concurrent item parameter estimation. Using simulated data from a test with 60 dichotomous items, 4 factors were considered: (1) program (MULTILOG versus BILOG-MG); (2) sample size per form (3,000 versus 1,000); (3) number of common items (20 versus 10); and...

The psychometric characteristics of the Test of Written English (TWE) rating scale were explored. Rasch model scalar analysis methodology was employed with more than 4,000 scored essays across 2 elicitation prompts to gather information about the rating scale and rating process. Results suggested that the intervals between TWE scale steps were surprisingly uniform and that the size of the intervals was appropriately larger than the error associated with assignment of individual ratings. The...

The metric of the multidimensional item response theory (MIRT) item parameter estimates is usually referred to as reference axes that are orthogonal and of unit length. This is due to the fact that most MIRT parameter estimation programs solve the identification problem by requiring that multidimensional abilities be distributed as multivariate normal distribution, N (0, I). Under this circumstance, the equated group's reference system can be transformed into the base group's reference system...

This study used real data to construct testing conditions for comparing results of chained linear, Tucker, and Levine-observed score equatings. The comparisons were made under conditions where the new- and old-form samples were similar in ability and when they differed in ability. The length of the anchor test was also varied to enable examination of its effect on the three different equating methods. Two tests were used in the study, and the three equating methods were compared to a criterion...

A special case of examinee choice, the Optional Essay Problem, is examined from the point of view of test equating. The Optional Essay Problem involves equating essay scores when the examinees are required to select an optional essay topic from a list of topics in addition to taking a mandatory test required of all examinees. The conditions that must be satisfied if the null hypothesis of equal difficulty of the essays holds true are derived. If this hypothesis, called "Livingston's Null...

Large sample standard errors for the Tucker method of linear equating under the common item nonrandom groups design are derived under normality assumptions as well as under less restrictive assumptions. Standard errors of Tucker equating are estimated using the bootstrap method described by Efron. The results from different methods are compared via a computer simulation as well as a real data example based on test forms from a professional certification testing program. (Author/PN)

There are a number of practical situations in which it would be desirable to be able to use the results of the administration of one assessment to estimate what the results would have been if another assessment had been administered. Test linking refers to the idea that results obtained from the administration of one test might be used to infer what the results would have been if another test had been used. Common knowledge, based on widespread experience with educational testing in the...

This paper addresses issues of vertical equating for the Arkansas Comprehensive Testing, Assessment and Accountability Program (ACTAAP) assessments as they relate to school accountability and determination of Adequate Yearly Progress (AYP) as required by the recent federal legislation, the No Child Left Behind Act. The paper first provides a brief statement of the problem, followed by a review of the testing in Arkansas, and related policies. It also examines some of the issues raised by this...

Kernel equating is a method of equating test scores devised by P. W. Holland and D. T. Thayer (1989). It takes its name from kernel smoothing, a process of smoothing a function by replacing each discrete value with a frequency distribution. It can be used when scores on two forms of a test are to be equated directly or when they are to be equated through a common anchor. The discrete score distributions are replaced with continuous distributions, and then equating is done with the continuous...

The term "equating" refers to a statistical procedure that adjusts test scores on different forms of the same examination so that scores can be interpreted interchangeably. This study examines the impact of equating with fewer items than originally planned when items have been removed from the equating set for a variety of reasons. A real data set from a licensure/certification examination was used for the study. Three linear equating methods and three test forms were involved. The...

This study addressed 2 issues of using loglinear models for smoothing univariate test score distributions and for enhancing the stability of equipercentile equating functions. One issue was a comparative assessment of several statistical strategies that have been proposed for selecting 1 from several competing model parameterizations. Another issue was an evaluation of the influence of the selection strategies on equating function accuracy. These issues were considered in a simulation study,...

The primary objective of this study was to find the smallest sample size for which equating based on a random groups design could be expected to result in less overall equating error than had no equating been conducted. Mean, linear, and equipercentile equating methods were considered. Some of the analyses presented in this paper assumed that the test scores were normally distributed. Other analyses were not based on this assumption. Real test data were used to check whether the theoretical...

In 1955, R. Levine introduced two linear equating procedures for the common-item non-equivalent populations design. His procedures make the same assumptions about true scores; they differ in terms of the nature of the equating function used. In this paper, two parameterizations of a classical congeneric model are introduced to model the variables in the Levine procedures for the external and internal anchor cases. The models differ in the constraints imposed on certain effective test length...

The logic and uses of test equating are discussed, including three methods of test equating. The focus is on the conceptual underpinnings of each test equating method, rather than on the mathematics of the procedures. Additional consideration is given to the assumptions of each method and its respective strengths and weaknesses. A commonly accepted definition of equivalent scores is based on the concept of equipercentile equating. The first step is to determine the percentile ranks of the...

The effectiveness of smoothing in reducing random errors in equipercentile equating of a short writing assessment with two raters, two prompts, with scores ranging from zero to five was examined. Thirteen methods were examined: no equating, three presmoothing, three postsmoothing, three combination presmoothing and postsmoothing, mean equating, linear equating, and unsmoothed equipercentile. The data for the study resulted from simulations of a writing assessment with one and two raters used...

A formula is derived for the asymptotic standard error of a true-score equating by item response theory (IRT). The equating method is applicable when the two tests to be equated are administered to different groups along with an "anchor test." Numerical standard errors are shown for an actual equating 1) comparing the standard errors of IRT, linear, and equipercentile methods; 2) illustrating the effect of the length of the anchor test on the standard error of the equating. (Author/BW)

The purpose of this study was to empirically examine the relationship between violations of the assumption of unidimensionality, as assessed by the factor analysis of item parcel data, and the quality of item response theory (IRT) true-score equating, as measured by score scale stability. The verbal section of the Scholastic Aptitude Test (SAT) and the College Board Mathematics Level II examination were selected for use. Factor analyses were performed on each of the six selected test forms,...

Two recent simulation studies were conducted to aid in the diagnosis and interpretation of equating differences found between random and matched (nonrandom) samples for four commonly used equating procedures: (1) Tucker; (2) Levine equally reliable; (3) Chained equipercentile observed-score; and (4) three-parameter, item response theory true-score equating. For these simulations logistic, test forms were equated to themselves, a situation that does not pattern reality. In the current...

Pseudo Bayes probability estimates are weighted averages of raw and modeled probabilities; these estimates have been studied primarily in nonpsychometric contexts. The purpose of this study was to evaluate pseudo Bayes probability estimates as applied to the estimation of psychometric test score distributions and chained equipercentile equating functions. Population test score distributions were created from actual test data and random samples of varied size were drawn from the populations....

Equating of tests composed of both discrete and passage-based items using the nonequivalent groups with anchor test (NEAT) design is popular in practice. This study investigated the impact of discrete anchor items and passage-based anchor items on observed score equating via simulation. Results suggested that an anchor with a larger proportion of passage-based items and/or a larger degree of local dependence among passage-based items produces larger equating errors, especially when group...

