Skip to main content

Full text of "ERIC EJ1111212: Development of Approximations to Population Invariance Indices. Research Report. ETS RR-08-36"

See other formats


Research Report 


Development of 
Approximations to 
Population Invariance Indices 


J inghua Liu 
XiaowenZhu 


J uly 2008 

ETS RR-08-36 



Listening. Learning. Leading.® 



Development of Approximations to Population Invariance Indices 


Jinghua Liu 
ETS, Princeton, NJ 

Xiaowen Zhu 

University of Pittsburg, PA 


July 2008 



As part of its educational and social mission and in fulfilling the organization's nonprofit charter 
and bylaws, ETS has and continues to learn from and also to lead research that furthers 
educational and measurement research to advance quality and equity in education and assessment 
for all users of the organization's products and services. 

ETS Research Reports provide preliminary and limited dissemination of ETS research prior to 
publication. To obtain a PDF or a print copy of a report, please visit: 

http://www.ets.org/research/contact.html 


Copyright © 2008 by Educational Testing Service. All rights reserved. 

ETS, the ETS logo, and LISTENING. LEARNING. 
LEADING, are registered trademarks of Educational Testing 
Service (ETS). ADVANCED PLACEMENT PROGRAM, 

AP, and SAT are registered trademarks of the College Board. 
PSAT/NMSQT is a registered trademark of the College Board 
and the National Merit Scholarship Corporation. 


ETS 





Abstract 


The purpose of this paper is to explore methods to approximate population invariance without 
conducting multiple linkings for subpopulations. Under the single group or equivalent groups 
design, no linking needs to be performed for the parallel-linear system linking functions. The 
unequated raw score infonnation can be used as an approximation. For other linking functions 
that are nonparallel-linear, linking only needs to be conducted for the total population. The 
difference of the standardized mean differences between each subpopulation and the total 
population across the old form and the new form can be used as an approximation of population 
invariance. Under the nonequivalent groups with anchor test design, conducting separate 
subpopulation linking and comparing them to the total population linking may still be the best 
way to estimate population invariance. 

Key words: Population invariance indices, single group linking, equivalent groups linking, 
NEAT linking, subpopulation linking 


1 



Table of Contents 


Page 

1. Dorans-Holland Measures of the Population Sensitivity of Score Linking Functions.2 

2. Parallel-Linear System of Linking Functions in the SG or EG Design—No Need to 

Conduct Any Linkings.5 

3. Equipercentile Linking in SG or EG Design—Conducting Linking Based on 

Total Population Only.6 

3.1 Results Based on Total Population Linking and Subpopulation Linkings—Full 

Equatability Analysis.7 

3.2 Results Based on Total Population Linking Only—Difference in Standardized 

Mean Differences Across Verbal and Critical Reading.8 

4. Sensitivity Indices in the NEAT Design.10 

4.1 Results Based on Full Equatability Analysis in an NEAT Design.11 

4.2 Results Based on Approximation: The Difference of the Standardized 

Mean Differences Between the Total Test and the Anchor Across 

the Old Form and the New Form.13 

5. Discussion.15 

References.18 


ii 













List of Tables 


Page 

Table 1. Summary Statistics of Full Equatability in an Equivalent Groups (EG) Design.8 

Table 2. Difference of the Standardized Mean Differences across Verbal and 

Critical Reading for Gender Groups.9 

Table 3. Sample Sizes for Equating New Form X to Old Form Tin a Nonequivalent 

Groups Anchor Test (NEAT) Design.11 

Table 4. Summary Statistics of Full Equatability in a Nonequivalent Groups Anchor 

Test (NEAT)Design.12 

Table 5. Raw Score Summary Statistics of Group Perfonnance in a Nonequivalent 

Groups Anchor Test (NEAT) Design.13 

Table 6. Difference of the Standardized Mean Differences Across the Total Test and 

the Anchor on the Old Fonn.14 

Table 7. Difference of the Standardized Mean Differences Across the Total Test and 

the Anchor on the New Form.14 

Table 8. Comparison of the Difference of the Standardized Mean Differences Between 

the Total Test and the Anchor Across the Old and the New Forms.15 


iii 











The goal of test score equating is to ensure that scores from one test fonn can be used 
interchangeably with scores from another form. Equating is the strongest form of test score 
linking. It requires strong assumptions: the same construct requirement, the equity requirement, 
the symmetry requirement, the equal (and high) reliability requirement, and the population 
invariance requirement (for more details, see Dorans & Holland, 2000; Holland & Dorans, 2006; 
Liu & Walker, 2007). In this paper, we focus on the last requirement; the population invariance 
requirement. 

The assumption of population invariance requires that the score equating function should 
be invariant across subpopulations from the total population from which the subpopulations are 
drawn. In other words, the equating function ought to be subpopulation independent. Kolen 
(2004) reviewed the research on population invariance, and concluded that population invariance 
holds approximately when alternate test forms are built to the same, or very similar, content and 
difficulty specifications. Equating should be population invariant, while other types of linking 
are not expected to be invariant (Holland & Dorans, 2006). 

The ETS research report titled Population Invariance of Score Linking: Theory and 
Application to Advanced Placement Program Examination (Dorans, 2003); the spring 2004 
special issue of Journal of Educational Measurement, titled Assessing the Population Sensitivity 
of Equating Functions (Dorans, 2004a); the 2004 special issue of Applied Psychological 
Measurement, titled Concordance (Pommerich & Dorans, 2004); and the 2008 special issue of 
Applied Psychological Measurement titled Population Invariance (von Davier & Liu, 2008 ); all 
contain collections of articles that study population sensitivity issues from different perspectives, 
across a variety of testing programs. For example, Yang (2004) examined whether the multiple- 
choice to composite linking functions in the Advanced Placement Program®, or AP®, exam 
remain invariant over subgroups defined by region. Yin, Brennan, and Kolen (2004) examined 
group invariance under concordance conditions between ACT and the Iowa Test of Educational 
Development (ITED) scores. Liu, Cahn, and Dorans (2006) examined population invariance of 
linking the revised SAT® to the old SAT, to assess the equatability of the revised SAT scores. 

The most commonly used population sensitivity indices were developed by Dorans and 
Holland (2000), where the total population is assumed to be partitioned into mutually exclusive 
and exhaustive subpopulations, and linking functions are conducted in the total population and in 
each subpopulation of interest. However, it can be very time consuming and computer-intensive 


1 



to conduct a separate linking function for each subpopulation (e.g., number of linkings = 
subpopulations x linking methods x measures.). For example, Dorans, Liu, Jiang, and Cahn 
(2006) conducted a score equity assessment (SEA; Dorans, 2004b), which produced estimates of 
means on the new SAT critical reading and math for gender groups and ethnic groups (White, 
Black, Asian American, Hispanic, and Other), assuming that the old SAT verbal and math had 
continued to be used. The number of subgroup linkings and scalings was 196 for critical reading 
and 210 for math, for a total number of 406. Dorans et al. (2006) assumed that the linking 
method chosen for the total population was appropriate for each of the subpopulations, which 
may not necessarily be true. If the researchers also tried multiple linking methods for each 
subpopulation, the total number of linkings would have been multiplied by 5 or 6. 

The goal of this study is to explore ways to approximate meaningful yet easily computed 
population invariance indices that do not require the creation of multiple subpopulation linking 
functions. This paper is organized in the following way. Section 1 reviews the Dorans-Holland 
measures of population sensitivity of score-linking functions. Section 2 discusses the parallel- 
linear system of linking functions in a single population, where there is no need to perform any 
actual linkings. Section 3 explores ways to approximate population invariance indices based on 
the total population linking function for the single group (SG) design or equivalent-groups (EG) 
design, when the linking functions are nonparallel-linear. Section 4 looks at the difference of the 
standardized mean differences between the total test and the anchor as an approximation for 
population invariance in the nonequivalent-groups anchor test (NEAT) design. Finally, section 5 
synthesizes these findings. 

1. Dorans-Holland Measures of the Population Sensitivity of Score-Linking Functions 

Dorans and Holland (2000) developed general population invariance indices of linking 
functions used for one population, either for a single group or for two groups that are equivalent, 
von Davier, Holland, and Thayer (2004) extended that work to the nonequivalent groups. 

Holland and Dorans (2006) synthesized the score-linking sensitivity indices across different 
linking designs. These methodological developments are pertinent to the present paper. 

Linking is usually conducted in the total group to produce a total group linking function 
and a total group scaling function that place raw scores onto the score reporting scale. To 
examine population invariance of linking functions, linkings and scalings are produced for each 
subpopulation of interest as well. The Dorans-Holland indices assume that the total population T 


2 



is partitioned into several subpopulations, 7} (j = 1,2, .. .)• X and Y are the two test forms to be 
linked. The linking on total population T is denoted by the linking function e T {x ), and e T (x) 

denotes the linking function for subpopulation 7}. Each subpopulation is weighted by its relative 
frequency, w,, so that X Wj = 1. The difference between e T (x) - e T (x) is then computed for 
each subpopulation. 

The first index is the root mean square difference measure, RMSD(x), defined as 


RMSD(x) = 


V 


X W J e T (x)-e r (x) 


< 7 , 


( 1 ) 


RMSD(x) provides an average across groups at each score level. Another index, root expected 
mean square difference (REMSD), provides a single number summarizing the values of 
RMSD(x). REMSD is obtained by averaging RMSD(x): 


REMSD = 



( 7 , 


YT 


( 2 ) 


where E r denotes expectation or average over the score distribution of X in T. 

In addition, we can also compute the root expected square difference for each 
subpopulation, RESD(/) to evaluate how close each subpopulation linking function is to the total 
population linking function: 


RESD(j) = 


e T (x) - e T (x) 


~l2 


<7 V 


( 3 ) 


RESD(/) weights by the relative frequency of new form X in the subpopulation 7}. There is a 
RESD(/) for each subpopulation. 

Note that the Dorans-Holland indices are based on the raw-to-raw linking and the divisor 
cTyj is used to quantity the differences in standard deviation units. However, we need to keep in 
mind that a raw-to-raw linking or an equating function is a transformation of raw scores on test 


3 




X, to the scale of raw scores on test Y. It is usually the first step of a two-step process by which 
raw scores on test X are put onto the reported scale on test Y. The second step is to convert the 
equated raw score of X to the reporting scale of Y, through a scaling function that maps the raw 
scores of Y to the scale. The first step of raw-to-raw equating function and the second step of 
scaling function are composed to convert the raw scores of X onto the reporting scale of Y 
(Holland & Dorans, 2006). The reported or the scaled scores are the final scores that test users 
get, and most readers are familiar with and can easily interpret scaled score values (e.g., the 
College Board 200-to-800 scale). 

Researchers have modified the standardized Dorans-Holland indices on the raw score 
scale and expressed the difference in the scaled score unit (Liu et al., 2006). The population 
invariance indices in the scaled score unit are then defined as: 


RMSD(x) = 



s T (x) - s r ( X ) 


REMSD = 



s T (x) - s T (x) 


and 


RESD(j) = J E 


“12 


s T (x) - s T (x) 


( 4 ) 

( 5 ) 

( 6 ) 


where s T (x) is the equating and scaling function or the raw-to-scale conversion based on 
subpopulation T ., and .sy(x) is the raw-to-scale conversion based on total population T. 

In order to evaluate the relative magnitude of the differences between subpopulation 
linking functions and the total population linking function, Dorans and Feigenbaum (1994) 
proposed the notion of the score difference that matters (DTM), in the context of SAT linking. 
On the SAT scale, scores are reported in 10-point units. For a given raw score, if the unrounded 
scaled scores resulting from two separate linkings differ by fewer than 5 points, then the scores 
should ideally be rounded to the same reported score. Dorans, Holland, Thayer, and Tateneni 
(2003) adapted the above indices used in SAT practice to other tests and considered the DTM to 
be half of a score unit for unrounded scores. 


4 



As can be seen from the formulae above, all of the calculations are based on total 
population linking and subpopulation linking functions. However, as we mentioned previously, 
performing multiple linkings can be very time and computer intensive. So the question remains: 
Is there a short cut that allows us to assess population invariance? 

2. Parallel-Linear System of Linking Functions in the SG or EG Design— 

No Need to Conduct Any Linkings 

Dorans and Holland (2000) examined RMSD(x) and REMSD for a special case, which 
they call the parallel-linear system of linking functions in the SG or EG design. The system of 
parallel-linear linking functions has the same slope between the subpopulation linking functions 
and the total population linking function. It only allows intercept differences between 
subgroup/total group linking functions. RMSD(x) and REMSD are equal for the parallel-linear 
system of linking functions: 


RMSD(x) = REMSD = 

where ju n , ju rr , /u XT , and denote the unequated raw score means of Y and X for 

subpopulation 7 and total population T, and cr )T and a XT denote the standard deviations of the 

unequated raw scores of Y and X for the total population. Therefore, as can be seen from the 
equation, we can estimate population sensitivity without conducting any linkings. 

Dorans and Holland (2000) illustrated the computation of RMSD(x) values for the 
parallel-linear case with several examples. They had two fonns, X and Y, two sets of scores, SAT 
verbal (SAT-V) and SAT math (SAT-M), on each form (the linkings were based on SAT-V to 
SAT-V and SAT-M to SAT-M), and three ways of forming subpopulations: gender, language 
spoken at home, and ethnicity. The results showed very little evidence of population dependence 
by the parallel linking functions. 

Liu and Holland (2008) also used this simplified version of RMSD(x) to explore the 
sensitivity of linking functions on the LSAT subpopulations defined by test-takers’ gender, 
ethnicity, geographic regions, whether they applied to law school, and their law school admission 
status. Population sensitivity was examined in three different linking situations: linking between 


£ 


w,. 


( f.Jyr . Ait ^ f Mxr, Mxt ^ 


CT v , 


<y v 


(V) 


5 



completely parallel tests, linking between tests that are not strictly parallel but are of comparable 
reliability, and linking between completely nonparallel tests. Results showed that linking parallel 
measures of equal reliability exhibits very little group dependence of linking functions across all 
the subpopulations studied, whereas the linkage of completely nonparallel tests shows substantial 
population dependence. Besides the main results of the study, it was shown that this simple 
version of RMSD(x) is a useful tool to assess population sensitivity, without carrying out the 
actual linkings. 

The beauty of this simplified formula for RMSD is that it is very easy to calculate. In 
reality, however, we often need to deal with situations that are more complicated than this. In the 
following section, we try to extend this simple version to a nonparallel-linear linking case: 
equipercentile linking. 


3. Equipercentile Linking in SG or EG Design— 

Conducting Linking Based on Total Population Only 

The equipercentile linking function is set so that the cumulative distribution function 
(CDF) of scores on form X converted to form Y scale is equal to the CDF of scores on form Y 
(Braun & Holland, 1982; Kolen & Brennan, 2004). This nonlinear transfonnation for total 
population T can be expressed as: 

y = Equiyj (x) = G y 1 [F r (x)] , (8) 

where F represents the CDF of X, G is the CDF of Y, and G 1 is the inverse of the CDF of Y. The 
intent is thatx and v have the same percentile in total population T. Similarly, for subpopulation 
7), the transformation equation is: 


v = Equi YT (x) = G r ' 



( 9 ) 


When we assume that the two CDFs, F T (x) and G T (v), have the same shape and only 
differ in their means and standard deviations, the equipercentile linking function becomes linear 
linking function, Lin YT (x) (Holland & Dorans, 2006), defined as: 


6 



(10) 


Liriyj (x) = /i }T H—— (x - /u XT ) . 

V XT 

Within the equivalent groups design, if the two forms can be equated, it is reasonable to 
assume that the means in the reported score scale should order various subpopulations in the 
same or a similar way across the new form and the old fonn (Holland & Dorans, 2006). In other 
words, the standardized mean difference for each subgroup should be identical or similar across 
the new and the old forms, 

li(ss) YTj - ju(ss) w _ juiss)^. -ju(ss)XT ^ 

cr ( 55 ) yr <j(ss) xt 

where ju(ss) and a(ss) are the mean and standard deviation of the scaled scores, respectively. 
As shown in Equation 11, the standardized mean difference is a type of effect size that quantifies 
the mean differences between two groups in standard deviation units. We just need to perform 
the linking based on the total population only, and then apply this total-population conversion to 
each subpopulation to get the summary statistics for each subpopulation. If the above equation 
does not hold for a particular group or groups, then this can serve as an indicator that the linking 
might be population dependent. 

In order to evaluate whether the above standardized mean difference can be used as an 
approximation of population invariance, and to explore the relationship between the standardized 
mean difference and the traditional RMSD(x) and REMSD indices, we examined empirical data 
from the spring 2003 new SAT field trial for illustration purpose. We first summarized the 
results based on subpopulation linking, which we call full equatability analysis. Then, we 
presented the results based on the total population linking, by examining the standardized mean 
differences between each gender subpopulation and the total population across the old version of 
the test (verbal section) and the new version of the test (critical reading section). 

3.1 Results Based on Total Population Linking and Subpopulation Linkings — 

Full Equatability Analysis 

In the 2003 new. SAT field trial, the booklets containing the new critical reading and the 
booklets containing the old verbal were spiraled, in an effort to yield equivalent groups. The 
resulting groups who took the new critical reading and the old verbal were deemed to be 


7 



equivalent (Liu et al., 2006). The critical reading section was then linked to verbal through the 
EG design for total population using equipercentile linking, and produced a total-group 
conversion. Equipercentile linking was performed for males and females as well, to yield a male- 
only conversion and a female-only conversion. We call this kind of analysis a full equatability 
analysis, since it involves both total population linking and subpopulation linkings. 

Table 1 presents the results of the full equatability analysis. For the 3,801 males who took 
the critical reading section, the results showed that they would have received a lower mean 
(474.9) if the male-only conversion (SGL in the table) had been used in place of the total group 
conversion (TGL in the table), which yielded a mean of 477.9. The mean difference was -3.4, 
with a standardized mean difference of -.03. For the 5,374 females, the full equatability analysis 
indicated that they would have obtained a higher mean (482.8) with a female-only conversion 
than with the total group conversion (480.4), with a mean difference of 2.3 and the standardized 
mean difference of .02. The RESD statistics are 3.7 and 2.7 for males and females, respectively. 
The REMSD value was around 3. These values are all below the DTM of 5, which suggests that 
the linkage of critical reading to verbal was essentially invariant across males and females. 

Table 1 


Summary Statistics of Full Equatability in an Equivalent Groups (EG) Design 


Group 

N 

Linking 

Mean 

SD 

Mean 

diff 

Std mean 
diff 

RESD 

Total 

9,194 

TGL 

479.4 

107.8 




Male 

3,801 

TGL 

477.9 

111.0 






SGL 

474.9 

110.0 

-3.4 

-.03 

3.7 

Female 

5,374 

TGL 

480.4 

105.3 






SGL 

482.8 

105.8 

2.3 

.02 

2.7 


Note. RESD = root expected square difference, SGL = subgroup linking, TGL = total group 
linking. 


3.2 Results Based on Total Population Linking Only — 

Difference in Standardized Mean Differences Across Verbal and Critical Reading 

The results in this section were based on total population linking only: We conducted 
total population linking, and then applied the conversion to males and females, to get the means 
and standard deviations for each group. We computed the difference in the standardized mean 


8 



differences across verbal and critical reading for each of the following comparisons: male versus 
total, female versus total, and male versus female. We then compared the results to those based 
on the full equatability analysis. 

Table 2 provides the means and standard deviations on the old verbal and on the new 
critical reading. The differences in standardized mean differences are presented as well. The data 
shows that on the old verbal the means were 474.8 and 479.3, for the male group and the total 
group, respectively. The standardized mean difference between the male group and the total 
group was -.04. This difference was based on 2,283 males out of 5,344 test-takers who took the 
verbal in the field trial. On the critical reading section, the standardized mean difference was -.01 
for the male minus the total group. This difference involved 3,801 males out of 9,194 test-takers 
who took the critical reading in the field trial. The difference in these two standardized mean 
differences was -.03 across the two tests. When compared to the equatability results described 
above, the two methods yielded identical values, -.03. 


Table 2 

Difference of the Standardized Mean Differences Across Verbal and Critical Reading for 


Gender Groups 




Verbal 


Critical reading 



N 

Mean 

SD 

N 

Mean 

SD 


Total 

5,344 

479.3 

107.9 

9,194 

479.4 

107.8 


Male 

2,283 

474.8 

110.4 

3,801 

477.9 

111.0 


Female 

3,055 

482.8 

105.8 

5,374 

480.4 

105.3 




Raw 

Std 


Raw 

Std 

Diff in std diff 



diff 

diff 


diff 

diff 

(Verbal - CR) 

M-T 


-4.5 

-0.04 


-1.4 

-0.01 

-0.03 

F-T 


3.4 

0.03 


1.1 

0.01 

0.02 

M-F 


-8.0 

-0.07 


-2.5 

-0.02 

-0.05 

Note. M 

- T = Male 

group - 

total group, F-T 

-= female 

group - 

total group, M - F = 


group - female group. 


For the female group, the standardized mean difference of female minus total was .03 on 
verbal and .01 on critical reading. The difference in the two standardized mean differences across 
the two tests was .02. This difference involved 3,055 females who took verbal and 5,374 females 
who took critical reading. Once again, this difference was identical to the results produced by the 
full equatability analysis. 


9 



The difference of the standardized mean differences between old verbal and new critical 
reading for the male minus female comparison was around .05 in absolute value. Compared to 
the REMSD value, which was around 3 in scaled score units and .027 in standard deviation units, 
the difference of the two standardized mean differences was about twice that of the REMSD 
value, considering rounding errors. It is reasonable in that when there are two subpopulations 
involved, the expected difference from the total population (-.03 for males and .02 for females) 
should be one-half the difference between the two subpopulations (-.05). 

The same pattern was also observed for the math results (Liu & Dorans, 2004). Hence, 
we may consider using means and standard deviations to estimate population invariance in the 
EG or SG design, without actually doing any subpopulation linkings. 

4 Sensitivity Indices in the NEAT Design 

In the NEAT design, population P takes form X and anchor A, and a different population 
Q takes form Y and the same anchor A When examinees with different abilities take different 
forms across different administrations in the NEAT design, it is more complicated to find a 
shortcut for assessing population sensitivity. However, the common items that are used to control 
examinee ability differences might be a place to start. In this paper, we only focus on chained 
linking with the NEAT design. 

Chained linking transforms scores through the following chained stages: First link X to A 
on population P; then link A to Y on population Q. These two linking functions are then 
composed to map A to Y through A. The first two stages are more like two SG linkings. Within 
each SG linking, it is reasonable to assume that the means should order various subpopulations in 
a same or similar way across the anchor and the total test. If population invariance holds across X 
and Y, it is also reasonable to assume that the means should order various subpopulations in a 
same or similar way across X and Y, and the anchor should order subpopulations in a same or 
similar way across the two populations. Hence, the mean differences between the total test and 
the anchor across the old and the new tests should be close. Any deviation could be a sign of 
subpopulation dependence. 

We can use the difference between the standardized mean differences of the total test and 
the anchor as an approximation. As shown in Equation 12, each component is actually an effect 
size, describing the differences in standard deviation units: 


10 



(12) 


MxPj Mxp Map, Map MyQj Myq Maq j Maq 


a 


XP 


u 


AP 


a. 


YQ 


a 


AQ 


where ji xp , /u xp , pi AP , and fi AP denote the raw score means ofX and A on subpopulation P, and 
population/ 5 , and a xp and <r AP denote the standard deviations ofXand A on P. Similarly, ju yQ , 
H YQ , jU AQ and /u AQ denote the raw score means of test Y and anchor A on subpopulation Q, and 
population Q; and a YQ , and a AQ denote the standard deviations of Y and A on Q. 

Again, we examine our hypothesis by comparing the full equatability analyses results, 
which were based on total population and subpopulation linkings, to the results based on the 
approximation using standardized mean differences. 


4.1 Results Based on Full Equatability Analysis in a NEAT Design 

Form X was a new SAT critical reading section, and Form Y was an old SAT-V section. 
Forms Xand Y were administered operationally in different SAT administrations. Form A was 
linked to Form Y, through an external anchor for the total population and each of the ethnic 
subpopulations. Table 3 contains sample sizes for the total group and ethnic subgroups. Note that 
these were the linking samples used when the test was equated, while the samples contained in 
Table 4 were obtained after equating, and were used to project summary statistics. The White 
group had relative large sample sizes, whereas other ethnic groups had much smaller sample 
sizes. The chosen equating function was the chained equipercentile equating using log-linear 
presmoothed data, for the total group and for each subgroup. 

Table 3 


Sample Sizes for Equating New Form X to Old Form Y in a Nonequivalent Groups Anchor 
Test (NEAT) Design 



New form 

Old form 

Total 

6,351 

15,746 

White 

3,928 

9,096 

Black 

444 

1,686 

Hispanic 

520 

1,215 

Asian American 

696 

1,405 

Other 

763 

2,344 


11 



Table 4 summarizes the results based on the total group linking (TGL) and the subgroup 
linking (SGL), including the difference of the means based on the total group linking and the 
subgroup linking (the mean diff), and the RESD statistics. 


Table 4 

Summary Statistics of Full Equatability in a Nonequivalent Groups Anchor Test (NEAT) 


Design 


Group 

N 

Linking 

Mean 

SD 

Mean 

diff 

Std 

mean 

diff 

RESD 

Total 

271,751 

TGL 

526.3 

110.0 




Asian American 

32,385 

TGL 

542.9 

113.9 






SGL 

537.1 

116.5 

-5.8 

-.05 

8.7 

White 

166,043 

TGL 

539.4 

101.1 






SGL 

539.8 

99.9 

0.4 

.00 

2.1 

Other 

31,202 

TGL 

536.1 

120.7 






SGL 

537.9 

121.2 

1.8 

.02 

3.6 

Hispanic 

21,617 

TGL 

473.7 

104.3 






SGL 

478.6 

106.2 

5.0 

.05 

6.4 

Black 

20,504 

TGL 

434.1 

100.4 






SGL 

433.7 

96.7 

-0.4 

-.00 

4.8 


Note. RESD = root expected square difference, SGL = subgroup linking, TGL = total group 


linking. 


The results indicate that the Asian American group would have received a lower mean 
(537.1) if the Asian American-only conversion had been used in place of the total group 
conversion, which produced a mean of 542.9, with a difference of 5.8 points. Similarly, the 
Black group would also have had a lower mean (433.7), if the Black-only conversion had been 
used. For the White, Other, and Hispanic groups, the subgroup-only conversions would have 
produced higher means than the total group conversion, with the mean differences being 
positive. The White and Black groups had the smallest mean differences, 0.4 in absolute value. 
For other subgroups, the mean differences range from 1.8 to 5.8 in absolute value. The biggest 
mean difference was found in the Asian American group (-5.8), followed by the Hispanic group 
(5.0). The RESD statistics concur with the mean differences as expected, in that the Asian 
American and Hispanic groups had the biggest RESD values: 8.7 for Asian American and 6.4 for 


12 



Hispanic. The differences for the Asian American and Hispanic groups were considered large 
enough (exceeding the DTM) to exhibit group dependence. 

In summary, the White group did not exhibit population sensitivity, whereas the Asian 
American and Hispanic groups exhibited large differences between the subgroup linking and the 
total group linking, to a degree that merits investigation. 

4.2 Results Based on Approximation: The Difference of the Standardized Mean Differences 
Between the Total Test and the Anchor Across the Old Form and the New Form 

This section examines the difference of the standardized mean differences between the 
total test and the anchor across the old form and the new form, as an approximation. Table 5 
contains the raw score summary statistics of population P taking form A and anchor^, and 
population Q taking form Y and anchor^, broken down by group membership. 


Table 5 


Raw Score Summary Statistics of Group Performance in a Nonequivalent Groups Anchor 
Test (NEAT) Design 


Group 

Old form 

New form 


Total test 

Anchor 

Total test 

Anchor 

Test length 

78 

19 

67 

19 

Total group - mean 

37.41 

9.21 

34.36 

9.73 

-SD 

18.17 

4.98 

15.56 

4.92 

Asian American 

37.54 

9.62 

36.47 

10.56 


19.56 

5.36 

16.13 

5.21 

Black 

22.05 

5.36 

21.17 

5.91 


14.88 

4.27 

15.08 

4.67 

Hispanic 

29.89 

6.91 

27.09 

7.42 


16.52 

4.64 

14.97 

4.83 

White 

40.67 

10.06 

36.08 

10.23 


16.42 

4.52 

14.40 

4.57 

Other 

39.60 

9.66 

36.20 

10.24 


19.94 

5.41 

16.30 

5.15 


13 



First, we calculated the standardized mean difference for each pair of subgroup minus 
total group on the total test and on the anchor on the old form Y. Table 6 lists the results. For 
example, the standardized mean difference between the Asian American group and the total 
group was .01 on the total test, and .08 on the anchor. The difference was -.07. Relatively 
speaking, the Asian American group did a little worse on the total test than on the anchor. So did 
the Black group, also with a difference of -.07. The White group did about the same on the total 
test and on the anchor. The Other group and the Hispanic group did a little better on the total test 
than on the anchor. 


Table 6 

Difference of the Standardized Mean Differences Across the Total Test and the Anchor 
on the Old Form 


Group 

Total 

Old form 

Anchor 

Total - anchor 

Asian American 

0.01 

0.08 

-0.07 

White 

0.18 

0.17 

0.01 

Other 

0.12 

0.09 

0.03 

Hispanic 

-0.41 

-0.46 

0.05 

Black 

-0.84 

-0.77 

-0.07 


Second, we got the standardized mean difference on the total test and on the anchor for 
each subpopulation on the new formX. The results are summarized in Table 7. Again, the Asian 
American group and the Black group did a little worse on the total test than on the anchor. But 
the Hispanic group did just about the same on the total test as on the anchor. 


Table 7 

Difference of the Standardized Mean Differences across the Total Test and the Anchor 
on the New Form 


Group 

Total 

New form 

Anchor 

Total - anchor 

Asian American 

0.14 

0.17 

-0.03 

White 

0.11 

0.10 

0.01 

Other 

0.12 

0.10 

0.02 

Hispanic 

-0.47 

-0.47 

0.00 

Black 

-0.85 

-0.78 

-0.07 


14 



Third, we compared the (total minus anchor) difference across the old and the new fonns. 
As can be seen from Table 8, the difference was -.04 for the Asian American group, and .05 for 
the Hispanic group. We also put the full equatability analysis results in Table 8, for the purpose 
of comparison. As we can see, the results based on the two methodologies are quite similar. 

Table 8 


Comparison of the Difference of the Standardized Mean Differences Between the Total Test 
and the Anchor Across the Old and the New Forms 


Group 

Diffof 

(total - anchor) 

Old form New form 

Std. mean diff of (total - 
anchor) across the old 
and the new forms 

Std. mean diff of (SGL - 
TGL) based on full 
equatability analysis 

Asian American 

-0.07 

-0.03 

-0.04 

-0.05 

White 

0.01 

0.01 

0.00 

0.00 

Other 

0.03 

0.02 

0.02 

0.02 

Hispanic 

0.05 

0.00 

0.05 

0.05 

Black 

-0.07 

-0.07 

0.00 

-0.00 


Note. SGL = subgroup linking, TGL = total group linking. 


However, at present, there is some disagreement about using this method. It is argued that 
P and Q are two different populations; hence they are not directly comparable (N. Dorans, 
personal communication, April 23, 2007). It is argued that this method neglects the possible 
interactions between the group membership and the test difficulty. Even if the difference of total 
minus anchor standardized mean differences across the old and the new forms is zero for a 
particular subgroup, it just means that this particular group finds the anchor test being similar to 
the total test at the difficulty level, in both the old form and the new form, but it does not reveal 
the relationship between the group membership and the form difficulty across the new form and 
the old form. 


5. Discussion 

The purpose of this paper was to explore methods in identifying population invariance, 
without conducting multiple linkings for subpopulations. Under the SG or EG design, no linking 
needs to be performed for the parallel-linear system linking functions. The RMSD(x) is equal to 
the REMSD value that can be calculated using unequated raw score information. For other 
linking functions that are nonparallel-linear, linkings only need to be conducted for the total 
population. The total population conversion can then be applied to different subpopulations, and 


15 



the difference of the standardized mean differences between each pairing of subpopulation and 
the total population across the old form and the new form can be used as an approximation of the 
full equatability population invariance indices. However, we would like to point out that the 
RMSD(x) statistics quantify weighted differences between subgroup versus total group linking 
functions at each score level, whereas the approach of standardized differences only take into 
account the means and standardized deviations of equated scores, ignoring the relative 
frequencies across score levels. Hence, small standardized mean differences cannot warrant 
population invariance at score levels. 

It is more complicated with the NEAT design when it involves two different populations. 
The difference of standardized mean differences between the total and the anchor test across the 
old and the new forms might be useful, but there is debate about using it. The results here were 
only based on one data set. More evidence needs to be collected. In addition, we basically used 
chained linear linking function, which may not be appropriate to expand to other linking 
situations where the relationship is not linear. 

This paper does not explore alternative ways to calculate population invariance indices 
with chained curvilinear linking and post stratification linking. These might be topics for future 
research. For example, it may be possible that we can break down the chained linking into 2 SG 
linkings, conduct chained curvilinear linking within each SG, and evaluate population invariance 
in each SG linking, using the standardized mean difference based on total group linking. 

In the case of post stratification equating (PSE), such as Tucker linear equating, we can 
first perform regression of X on A in total population P and in different subpopulations, to get a 
regression slope and a regression intercept for each subpopulation. We also need to calculate the 
conditional variance of X given A, in population P and for each subpopulation. If the slopes, 
intercepts, and conditional variances are invariant across subgroups, then it is likely that the 
conditional distribution of X given A is population invariant within population P. A similar set of 
analyses would need to be done within population Q, to determine whether the conditional 
distribution of Y given A is population invariant. If population invariance is satisfied in both 
populations, then population invariance is going to hold in the synthetic population, given the 
assumptions of PSE. However, in this case, the amount of actual work is not reduced. Instead, it 
gets increased. If we perform a regression analysis for each subgroup in populations P and Q, it 
is reasonable that we might want to go ahead and conduct the equating for each subgroup. 


16 



Essentially, the approximation methods of using standardized mean differences proposed 
in this study are pretty much based on the assumptions of linear equating or linear linking. It 
seems paradoxical, though, to evaluate population invariance of nonlinear linking functions 
using such linearity-based statistics. Therefore, we suggest using the standardized mean 
difference only as an approximation of population invariance in the SG or EG design. Under the 
NEAT design, conducting individual subpopulation linkings and comparing them to the total 
population linking is probably still the best way to determine population invariance. 


17 



References 

Braun, H. I., & Holland, P. W. (1982). Observed-score test equating: A mathematical analysis of 
some ETS equating procedures. In P. W. Holland & D. B. Rubin (Eds.), Test equating 
(pp. 9-49). New York: Academic Press. 

Dorans, N. J. (Ed.). (2003). Population invariance of score Unking: Theory and applications to 
Advanced Placement Program examinations (ETS Research Rep. No. RR-03-27). 
Princeton, NJ: ETS. 

Dorans, N. J. (2004a). Assessing the population sensitivity of equating functions (Special issue). 
Journal of Educational Measurement, 41(1). 

Dorans, N. J. (2004b). Using subpopulation invariance to assess test score equity. Journal of 
Educational Measurement, 41(1), 43-68. 

Dorans, N. J., & Feigenbaum, M. D. (1994). Equating issues engendered by changes to the SAT 
and PSAT/NMSQT®. In I. M. Lawrence, N. J. Dorans, M. D. Feigenbaum, N. J. Feryok, 
A. P. Schmitt, & N. K. Wright (Eds.), Technical issues related to the introduction of the 
new SAT and PSAT/NMSQT (ETS Research Memorandum No. RM-94-10). Princeton, 
NJ: ETS. 

Dorans, N. J., & Holland, P. W. (2000). Population invariance and the equatability of tests: Basic 
theory and the linear case. Journal of Educational Measurement, 37(A), 281-306. 

Dorans, N. J., Holland, P.W., Thayer, D. T., & Tateneni, K. (2003). Invariance of score linking 
across gender groups for three Advanced Placement program examinations. In N. J. 
Dorans (Ed.), Population invariance of score linking: Theory and applications to 
Advanced Placement Program examinations (ETS Research Rep. No. RR-03-27, pp. 79- 
118). Princeton, NJ: ETS. 

Dorans, N. J., Liu, J., Cahn, M., & Jiang, Y. (2006). Score equity assessment of transition from 
SAT I Verbal to SAT Critical Reading: Gender (ETS Statistical Rep. No. SR-06-61). 
Princeton, NJ: ETS. 

Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), 
Educational measurement (4th ed., pp. 187-220). Westport, CT: Prager. 

Kolen, M. J. (2004). Population invariance in equating and linking: Concepts and history. 
Journal of Educational Measurement, 41(1), 3-14. 


18 



Kolen, M. J., & Brennan, R. L. (2004). Test equating, linking, and scaling: Methods and 
practices (2nd ed.). New York: Springer-Verlag. 

Liu, J., Cahn, M., & Dorans, N. J. (2006). An application of score equity assessment: Invariance 
of linking of new SAT to old SAT across gender groups. Journal of Educational 
Measurement, 43(2), 113-129. 

Liu, J., & Dorans, N. J. (2004). Projected changes in ethnic and gender group performance: An 
approximate assessment for the new SAT (ETS Statistical Rep. No. 2004-23). Princeton, 
NJ: ETS. 

Liu, J., & Walker, M. E. (2007). Score linking issues related to test content changes. In N. J. 

Dorans, M. Pommerich, & P. Holland (Eds.), Linking and aligning scores and scales (pp. 
109-134). New York: Springer-Verlag. 

Liu, M., & Holland, P.W. (2008). Exploring the population sensitivity of linking functions across 
three law school admission test administrations. Applied Psychological Measurement, 
32(1), 27-44. 

Pommerich, M., & Dorans, N. J. (Eds.). (2004). Concordance [Special issue]. Applied 
Psychological Measurement, 28(4). 

von Davier, A. A., & Liu, M. (Eds.). (2006). Population invariance of testing equating and 

linking: Theory extension and applications across exams (ETS Research Rep. No. RR- 
06-31). Princeton, NJ: ETS. 

von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The chain and post-stratification 

methods of observed-score equating: Their relationship to population invariance. Journal 
of Educational Measurement, 41(1), 15-32. 

Yang, W. L. (2004). Sensitivity of linkings between AP multiple choice scores and composite 
scores to geographical region: An illustration of checking for population invariance. 
Journal of Educational Measurement, 41(1), 33—41. 

Yin, P., Brennan, R. L., & Kolen, M. J. (2004). Concordance between ACT and ITED scores 
from different populations. Applied Psychological Measurement, 28(4), 273-289. 


19