Page 367 - Invited Paper Session (IPS) - Volume 2
P. 367

IPS279 Rense Lange
            formal proof or derivations. Those interested in a more complete presentation
            should consult [8].
              Rating scales are modelled analogous to the binary case. Figure 3 shows
               that frequencies are collected in a matrix F with sides nitems.nsteps1, where
               nitems  is  the  number  of  items,  nsteps  denotes  the  number  of  answer
               categories and nsteps1 = nsteps –1. Higher ratings (i.e., 1, ..) occur along
               the rows, whereas lower values occur along the columns (…, nsteps1). Each
               new  rating  has  to  be  compared  to  all  others,  requiring  a  total  of
               nitems(nitems – 1) / 2 comparisons. Realistic applications rarely involve over
               five  rating  scales  which  takes  under  10  numerical  comparisons.  As  an
               example, assume that the very first observations to be added to the zero
               matrix F are 0, 2, and 1 for items 1, 2, and 3, respectively. In this case the
               item table is updated as is shown by the 1’s in the left table. Of course, later
               observations will be added cumulatively as grading progresses.
              Computing the rater severity parameters requires updating a  second F
               matrix of size nrat x nrat (see Figure 3, right side) where nrat denotes the
               total  number of  raters  –  and  the focus  is on  the  sum  of  the  ratings.  In
                                                                                  rd
                                                                                     th
               particular, whenever the same person is being rated by a second (or 3 , 4 ,
               …) rater, the sum of the last ratings is compared to the earlier summed
               ratings. The table then simply tallies by raters the number of times one of
               the sums exceeds the other by exactly one point.
              Item biases can be studied by computing group-specific rating x categories
               F matrices (see below).
               Computational Effort During Data Collection. Because the number of
            test  takers  can  be  very  large,  updating  of  the  rater  table  is  far  more
            computationally expensive overall than is the updating the item table. That is,
            one has to keep track of doubly rated students, and their rating sums of have
            to be compared across raters. In the worst case, each new case may require
            inspecting all previously processed test-takers. Assuming that npers students
            have  already  been  graded,  this  could  potentially  require  npers.(npers-1)/2
            comparisons. The use of random access techniques can reduce this number.
            Given the frequency matrices, the following steps are required to obtain actual
            parameter estimates:
                  The ratio of off diagonal elements is required. For instance, the rater
                    matrix F in Figure 3 yields the ratios Rij = Fji / Fji shown in the centre
                    table of Figure 4. See below for dealing with zeros.
                  Taking the logarithms of each entry in R yields the matrix log(R) shown
                    on the right of Figure 4.
                  The  row-means  of  log(R)  represent  the  severity  parameters  Sk
                    (rightmost Sk column of Figure 4).
                  Updating  the  D  and  S  frequency  matrices  is  computationally
                    independent. Thus, these updates can be performed in parallel using

                                                               354 | I S I   W S C   2 0 1 9
   362   363   364   365   366   367   368   369   370   371   372