We routinely rely on automated assessment to evaluate our students’ work on programming assignments. In principle, these techniques improve the scalability and reproducibility of our assessments. In actuality, these techniques may make it incredibly easy to perform flawed assessments at scale, with virtually no feedback to warn the instructor. Not only does this affect students, it can also affect the reliability of research that uses it (e.g., that correlates against assessment scores).
To Test a Test Suite
The initial object of our study was simply to evaluate the quality of student test suites. However, as we began to perform our measurements, we wondered how stable they were, and started to use different methods to evaluate stability.
In this group, we take the perspective that test suites are classifiers of implementations. You give a test suite an implementation, and it either accepts or rejects it. Therefore, to measure the quality of a test suite, we can standard metrics for classifiers, true positive rate and true negative rate. However, to actually do this, we need a set of implementations that we know, a priori, to be correct or faulty.
|Accept||True Negative||False Negative|
|Reject||False Positive||True Positive|
A robust assessment of a classifier may require a larger collection of known-correct and known-faulty implementations than the instructor could craft themselves. Additionally, we can leverage all of the implementations that students are submitting—we just need to determine which are correct and which are faulty.
There are basically two ways of doing this in the literature; let’s see how they fare.
The Axiomatic Model
In the first method, the instructor writes a test suite and whatever that test suite’s judgments is used as the ground truth; e.g., if the instructor test suite accepts a given implementation, it is a false positive for a student’s test suite to reject it.
The Algorithmic Model
The second method does this by taking every test suite you have (i.e., both the instructor’s and the students’), running them all against a known-correct implementation, and gluing all the ones that pass it into one big mega test suite that is used to establish ground truth.
A Tale of Two Assessments
We applied each model in turn to classify 38 student implementations and a handful of specially crafted ones (both correct and faulty, in case the student submissions were skewed heavily towards faultiness or correctness), then computed the true-positive and true-negative rate for each student’s test suite.
The choice of underlying implementation classification model substantially impacted the apparent quality of student test suites. Visualized as kernel density estimation plots (akin to smoothed histograms):
The Axiomatic Model:
Judging by this plot, students did astoundingly well at catching buggy implementations. Their success at identifying correct implementations was more varied, but still pretty good.
The Algorithmic Model:
Judging by this plot, students performed astoundingly poorly at detecting buggy implementations, but quite well at identifying correct ones.
Towards Robust Assessments
So which is it? Do students miss half of all buggy implementations, or are they actually astoundingly good? In actuality: neither. These strikingly divergent analysis outcomes are produced by fundamental, theoretical flaws in how these models classify implementations.
We were alarmed to find that these theoretical flaws, to varying degrees, affected the assessments of every assignment we evaluated. Neither model provides any indication to warn instructors when these flaws are impacting their assessments. For more information about these perils, see our paper, in which we present a technique for instructors and researchers that detects and protects against them.