Comparing the Robustness of Automated
One of the frequent criticisms of automated essay scoring is that engines do not understand language and therefore can be ‘tricked’ into giving higher scores than they should. Engines have been found to be susceptible to these responses, but the impact of such responses varies by item and engine design. Additionally, automated scoring of essays can be viewed negatively by the public in part because of how they identify and score unusual responses. As a result, almost every operational automated essay scoring engine uses filters to identify aberrant responses, either flagging them as such or routing them for human review and scoring.
The state-of-the-art in machine learning scoring has evolved in recent years to achieve gains in accuracy in a number of predictive tasks. While older models used feature-based approaches whereby experts wrote algorithms to create features thought relevant to item scoring and predicted scores using these weights applied to these features, newer approaches learn features alongside the predictive model using very large, multi-layered neural networks (often called deep learning). Importantly, these models are designed to include sequence – i.e., word order – in the modelling process and thereby are thought to better model language than bag of words methods. One potential promise of deep learning models is that they are more robust to gaming behaviors because they consider word use in context and therefore may not require filters or may require fewer filters.
Given these developments, this study sought to examine the robustness of one deep learning method to a traditional automated scoring method on a set of gaming responses. The deep learning method used a model called BERT (Bidirectional Encoder Representations from Transformers) that was initially trained to predict a masked word and to predict whether a sentence followed a prior sentence. BERT was finetuned on sets of essays in order to classify, or predict, rubric-based scores. The traditional or classical automated scoring method used Cambium Assessment’s Autoscore, which is based upon a mix of expert-designed features to assess writing quality and Latent Semantic Analysis (LSA) to assess concepts in essays. LSA is a bag-of-words methodology that examines the relationship between word usage and essays by statistically generating topics based upon document and word frequency. Notably, LSA does not model word order. The gaming responses considered were: shuffled text (to examine the assumption that word order is learned by neural network models); grammatically-correct but nonsense essays (to examine the extent to which grammar is contributing to score); off-topic essays (to examine the extent to which essay meaning is contributing to score); and finally, duplicated essays (to examine the extent to which length is contributing to score, controlling for meaning).