Page 369 - Invited Paper Session (IPS) - Volume 2
P. 369

IPS279 Rense Lange
            three-category  rating-scale  (e.g.,  “insufficient,”  “intermediate,”  and
            “sufficient”) with step values Fw = {-1.1, 1.1}. There are a total of five graders
            available to grade the essay answers of 10,000 test takers.
                The  graders’  severity  parameters  Sk  were  drawn  from  N(0,0.25),  which
            agrees with values obtained in actual large-scale grading applications [6]. The
            reading abilities of 10,000 students (Tj), were drawn from N(0.25, 1.1). Half of
            the students were graded twice, the others once. Assignment of graders to
            students  was  random,  except  raters  occurred  equally  often.  While  all
            “students”  contribute  to  the  item  and  steps  calibrations,  the  rater  severity
            estimates depend solely on the students that are graded twice.
               Parameter  Recovery.  As  was  the  case  for  binary  items,  the  difficulty
            parameters  and  step-values  could  be  recovered  admirably,  with  the  Root
            Mean Square Error (RMSE) equal to 0.04 – even after just 100 students had
            been graded. Items’ step values likewise were recovered accurately (RMSE =
            0.02). Although this work has not yet been completed, the accurate recovery
            of D and F implies that the JMLE recovery of the person parameters T promises
            to be accurate as well. For efficiency, estimating T should be combined with
            the computation of diagnostic and quality-control information.

            Figure 5: Simulated recovery of grader severity parameters across number of students
            graded twice.



















            The main interest here is the recovery of graders’ severity parameters as these
            require the greatest computational effort. Figure 5 shows the average RMSE
            (Y-axis) of the Sk estimates as a function of the number N of twice graded
            students (along X axis). The vertical bars reflect the standard deviation (SD) of
            these estimates over 100 independent repetitions with different samples of
            10,000 “students.”
                It can be seen that Sk estimates’ RMSE decreases (correlation increases)
            rapidly. This trend slows after N = 300, eventually tending to zero (one). Very
            similar  patterns  obtain  for  other  realistic  parameter  settings.  The  present
            approach was highly efficient, e.g., given the updated frequency matrices, the


                                                               356 | I S I   W S C   2 0 1 9
   364   365   366   367   368   369   370   371   372   373   374