Using automated scoring (AS) to score constructed-response items has become increasingly common in K–12 formative, interim, and summative assessment programs. AS has been shown to perform well in essay writing, reading comprehension, and mathematics. However, less is known about how automated scoring engines perform for K-12 students by key subgroups such as gender, race/ethnicity, English proficiency status, disability status, and economic status. In addition, measures for examining for the existence of bias are not well-defined.
The study examined Cambium Assessment, Inc.’s automated scoring engine, Autoscore, performance on 24 reading comprehension items from grades 3–8 and 11. Bias was examined in these subgroups using the full data set and, matched using propensity score matching on ability and subgroup covariates. The purpose of matching was to examine bias when examinee ability is controlled. In addition to the usual agreement metrics of quadratic weighted kappa (QWK), exact agreement, and standardized mean difference (SMD), the study also used agreement matrices to examine possible bias in patterns of rubric score applications by human raters and by the automated scoring engine.
The study results indicated that, across methods, the engine demonstrated little evidence of bias for most subgroups on the full sample but more evidence of bias when groups were matched. It is yet unclear what contributed to this different performance when controlling for ability. The study also showed that humans themselves had difficulty scoring responses at some regions of the scale, and that the engine replicated that difficulty.
Download the White Paper