A Comparative Study of Interrater Reliability Coefficients Obtained from Different Statistical Procedures Using Monte Carlo Simulation Techniques

Ebrima Nying, Western Michigan University

Abstract

Reliability estimation is a key research component within the global area of educational assessment. The literature reports numerous studies using different statistical techniques for estimating reliability of educational measures. However, few have focused on the estimation of interrater reliability of performance assessment (Abedi, Baker, & Herl, 1995). Specifically, this study compared three different methods for estimating interrater reliability to determine if there are differences among these estimates as a function of: sample size, measurement scale, number of raters and the theoretical population reliability (rho). The three methods of estimation were the Intraclass Correlation (ICC(2, k )) (Shrout & Fleiss, 1979), Kendall's Coefficient of Concordance (Kendall, 1938) and Kappa for Multiple Raters (Fleiss, 1971).

This study employed a 4-between (three measurement scales, four sample sizes, four rater groups, and two population rho's), 1-within (three estimation methods) mixed design. ANOVA results indicated two significant 4-way interactions involving measure*Mscale* rater*rho and measure*Mscale* size*rho. Simple effect analyses of the 4-way interactions focused on each level of rho and parallel 3-way ANOVAs (measure*Mscale*size or rater) were conducted. Post hoc analyses indicated that both 3-way interactions were significant. To further understand these 3-way interactions, they were followed by 2-way designs holding Measure constant.

Finally, Tukey HSD tests were then examined. This post hoc strategy consistently showed mean differences in the reliability estimates among the differentmeasurement scales as a function of both number of raters and sample size regardless of rho.

Recommendations for future research and limitations of this study are provided. Although population rho had a predictable effect, the present findings should be replicated and extended with different values of population rho and since this study only considered normally distributed data, the influence of population distribution should be examined, especially with response scales having fewer levels.

Practitioners and researchers who are faced with estimating interrater reliability should be comforted by the striking stability of these methods. When the data are dichotomous, Kappa for Multiple Raters is recommended, and when the data are ordinal, Kendall's Coefficient of Concordance is recommended.