Saturday, March 9, 2013

Predicting State Test Scores: Folly or Sensibility?

Can teachers precisely and accurately predict student scores on state standardized tests?

Last week, teachers at our school were asked to predict our students' state test scores. I had extreme difficulty engaging in this task because I felt it was speculative, fraught with error, and ultimately had no statistical significance. The task continues to bother me, and I need to explore my unsettled thoughts and feelings more deeply.
Science Education vs. State Testing?

Predicting how students will score on a state-level standardized test feels mostly like guesswork; in fact, the task was described as being one of "gut feelings" and "guesses." I have no qualms with making educated predictions—this is something scientists do all the time. But I'm worried that this year's mostly random guesses will be used for evaluative purposes at some later date: "Why were your predictions wrong?"

To progress from random guesses to educated hypotheses to informed decisions requires controlled methodology, meticulous experimentation, and detailed data collection in order for conclusions to have validity. The peril lies in jumping from guesses to conclusions, which I fear is where we are headed.

In our current system, few people in the educational realm—administrators and teachers alike, let alone students and parents—completely understand how the sausage is made when it comes to creating standardized tests. Our state provides technical papers on its website that explain some of the behind-the-scenes details on how state tests are created and scored. These papers are long and dense, and many of the methodologies described within the papers require an advanced understanding of statistics. Nevertheless, having reviewed the papers, I gleaned some valuable (if disturbing) insight into the sausage-making process.

Our state establishes four categories of performance on standardized tests: advanced, proficient, partially proficient, and unsatisfactory. These four categories are delineated by cut scores on the various tests that students take each year, which include writing, reading, math, and science. The cut scores are established through a process which determines the probability of a student correctly answering a particular question on the state test. For example, a test question will be considered to be a "proficient question" if a student has a 2/3 probability of answering that question correctly. Each test is a mixture of questions that fall across  the spectrum of cut score categories.

State tests are mainly comprised of two types of questions: selected response (i.e., multiple choice) and constructed response (i.e., written sentences and paragraphs). This model exists because both selected response and constructed response questions are easiest to score and statistically analyze. Selected response questions are either right or wrong, and constructed response questions have a rubric-based scale for scoring via keyword analysis. Selected response questions are limited in their ability to assess high-level skills such as critical thinking and problem-solving. Constructed response questions are limited to matching student responses against key words and phrases found in the scoring rubric—creativity and originality are not part of the equation.

The combination of the question response type and cut scores allows for the "best" statistical analysis of student performance on state standardized tests. These details have been tweaked over the years to a level of optimization that permits the state to categorize students as advanced, proficient, partially proficient, or unsatisfactory. The very way that tests are constructed prevents all students from being either advanced or unsatisfactory: if all students are unsatisfactory, the test is too hard; if all students are advanced, the test is too easy. Thus, the tests themselves have been constructed in such a way that there will always be a distribution (or variance) of scores across all four categories—proficiency exists in realm of endless statistical manipulation in which there will always be winners and losers, and the game will never end.

Which brings us back to the question of whether teachers can predict test scores…

If we assume that teachers can predict (guess) test scores, how valid and reliable are those predictions?

Recall that student performance on the state test is sorted into four categories: advanced, proficient, partially proficient, and unsatisfactory. What are the probabilities that a student will fall into one of those categories? In our school, very few students fall into the unsatisfactory category: our students are generally good writers and readers, and since state tests are reading- and writing-based students will be able to decode and respond to the test questions fairly well. Those students who score unsatisfactory tend to fall into the following categories: special education, English language learners, and intentional non-learners. If you are unable to read/write in the English language because of learning disabilities, language barriers, or complete apathy, then you will probably not score highly on the test. Because these conditions apply to a relatively small group of students at our school, we can predict that most of our students will either be advanced, proficient, or partially proficient. I, then, have a one-in-three chance of correctly predicting  my students' test score. To be safe, I will tend to classify each student as "proficient" unless I have a solid feeling or reason for choosing advanced or partially proficient. There is very low risk to my predictions, which makes them feel little better than guesswork.

If I wanted to make better guesses—more educated hypotheses—I would need to fully understand all of the variables involved in standardized testing. What are the variables that determine whether a student is advanced, proficient, partially proficient, or unsatisfactory? I alluded to the fact that students who score unsatisfactory may do so because of many different types of barriers. In a similar argument, students who score advanced probably have many fewer barriers which impede their performance. Can we quantify the myriad variables and barriers that affect each and every student's performance on a single set of tests given once per year in a highly artificial testing environment? (um, no...) Then, how can we predict, with accuracy and fidelity, how students will score on these tests?

If we accept that our test score predictions are nothing better than guesses, then what validity do they have at all? One of the rationales posited was that the prediction exercise increases inter-rater reliability, the measure of how reliably different people can assess and score the same test. To improve the accuracy and precision of inter-rater reliability requires knowledge of the test itself. Here we encounter another large barrier. Our state has deemed (rightly so) that it is unethical to "teach to the test," and that access to and use of testing materials throughout the school year is prohibited. It is not possible to improve inter-rater reliability in a vacuum; without data to work with, our best-intentioned predictions are still merely guesses. At most, our year-to-year predictions can be considered "persistent."

In meteorology, scientists rely on a wealth of data in their attempts to make precise and accurate weather predictions. At the lowest level of weather forecasting is persistence, the notion that tomorrow's weather will be the same as today's weather. This type of forecasting is valid only if weather conditions don't change in that time period. If any variable changes, then a persistence forecast is extremely poor and unreliable; in fact, it will probably "bust". Beginning meteorology students learn quickly that persistence forecasting is highly unscientific and that accurate weather forecasting relies on deep understanding of how all weather variables are interacting and evolving throughout atmospheric time and space.

When we are ignorant of the variables that affect a student's performance on a state test, I feel that our attempts to predict a student's future performance lack sensibility and are at best folly…