The Exam Statistics: The Difficulty

In my last post, I discussed how I analyze the mean in my (multiple choice) exams. This time, I'm going to look at difficulty. This is not directly related to the mean. Huh? Isn't it the case that, the more difficult the exam, the lower the mean? Well, yes. But that's not the "difficulty" I'm writing about.

Among all the pages and pages of results I get from Test Scoring & Questionnaire Services is the "DIF" score or difficulty of each question. It's actually the proportion of the class who answered that question correctly. DIF=1.000 means that everyone got it right, but DIF=0.250 means that only 25% of the class did. But it's not really "difficulty," is it? If a question is really difficult, fewer people will answer it correctly and the number should decrease. So, really, it shouldn't be called difficulty, it should be called easiness. But, look, it's just called "difficulty," OK?

You might be thinking that I want everyone to answer every question correctly, right? Um, sorry to rain on your ice cream, It's really, really unlikely that everyone was able to learn absolutely everything in the course, and was also able to remember and apply that knowledge on an exam perfectly correctly for every question. What an exam should do is assess each student's learning of the material, and provide some way of differentiating among all students. If all questions are answered correctly, the exam itself has failed.

I went to a seminar last year at which a renowned expert in testing and exam question construction gave a talk. After it was over, I talked to him about DIF scores--specifically, what should they be? The general rule is that an exam question is doing a good job of differentiating among students if it's at least 0.300. That is, at least 30% of the class should be getting each question correct. There is no guideline for the upper end, but at another seminar, I heard an instructor say that she liked to put at least one DIF=1.000 question on each exam as a confidence booster. Yup, a gimme. I thought that was a pretty nice thing to do, so I try to include at least one high DIF question on every one of my exams, too.

So difficulty is related to the mean in that, the higher the DIF, the higher the mean on the exam overall. The mean is good for evaluating the overall performance of the class. But I also need to evaluate the questions on my exams, so I get the DIF score for each one. If the DIF is too low, the question either gets killed (*snff*), or rewritten to clarify it. Oh, and if I ever get DIF=0.000, it means I've keyed in an incorrect answer. Ooops.

Why aren't you studying?


Anastasia said...

"What an exam should do is assess each student's learning of the material, and provide some way of differentiating among all students."

Yes, the function of university is as much to sort people as it is to teach them...

Find It