Testing and Grading

January 29, 2013

(This article first appeared in the Graduate Teacher Program Handbook. Copyright © 1988 by the Board of Regents, University of Colorado.)

Philip Langer, Professor and Chair
Educational Psychological Studies


The topics involved in evaluation are so broad that I can address only a few critical items. These include factors that impinge on the reliability and validity of a test, as well as the scoring of papers, projects, and other essay-like products.

The reliability of a test is, for purposes of evaluation, the stability of the the test score, more basically, the standard error of measurement. Obviously the larger the error of measurement, the less trust in the score as a viable measure of student performance. Probably the best technique is to increase the number of items to the point where the stability is acceptable to you. Obviously there are limits. You can make the test so long that you are either including trivia, or the students cannot finish in time. We will deal with these problems below.

Validity, or what the test measures, is commonly judged on the basis of the test content itself. I would like to discuss several faulty test construction techniques which diminish validity. I will confine my remarks to the construction of multiple-choice items since these are the most frequently used.

In addition, I will present a model for reliably evaluating products such as essays, projects, and the like.

Should I use "canned" tests?
Most text publishers for survey courses such a General Psychology generally include test files as part of the package. Instructors use them because they are time savers. These vary from good to awful; frequently they have never been tested out in class. The key problem is that if you stick strictly to the test file, you will not include content especially added in lecture. If students catch on that the tests are tied solely to the book, the additional materials you bring will be ignored.

What then should the test include?
It follows that both lecture (or discussion) and text are legitimate sources of test items. I would not hesitate to use good items from the publisher file, but I would also be certain to use questions clearly derived from additional lecture or discussion materials.

How long should the test be?
As a rule of thumb I assume that the average student can complete one multiple-choice item per minute. If there are 50 minutes available, I give 50 items. This usually allows the student to check answers. In addition, a test of 50 items has adequate reliability.

What are some precautions to take in construction (or selecting) multiple-choice items?
As one might guess, some of the comments made regarding questions apply here. One of the most common mistakes made in constructing multiple-choice items is to develop the question as basically a fill-in derived from the text. For example, you come across the sentence: "An example of a variable-ratio schedule is gambling." The multiple-choice item becomes: "An example of a variable-ratio schedule is: (a) fixed-ratio, (b) fixed-interval, (c) variable-ratio, and (d) variable-interval. Stay as far away from such construction strategies since the test validity becomes highly correlated with recall for precise test working. I can remember one test file solely derived in such a manner.

Second, be sure you include at least several items which require more complex processing. These are difficult to construct, and no one expects you to make all items at this level, but surely it is possible to have a student recognize that an alarm clock is a CS, instead of recalling it was a metronome Pavlov used. I could give other examples, but the intent is to show the importance of using items that go beyond a simple factual level.

Third, watch the wording of items. That has two implications for you. First, the smart (testwise) student goes through the entire exam before beginning to respond. All too often the answer to one question may be found in the stem of another. Thus the student may not recall Pavlov in connection with classical conditioning, but may find an item which starts out by identifying Pavlov with the critical experiment as described in the text. Next, if the question stem ends in "a", and all the choices begin with a vowel except one it does not take a genius to figure out the correct answer. The best way to avoid the problem is to use "a(n)". The same thing is true for plurals, i.e., use "is (are)."
What else?

I'll let you in on a secret. If you haven't got a clue as to the correct answer on a multiple-choice test, choose the longest response. The teacher wants to be able to defend it. In short, watch length of response as a clue.

Next, avoid as much as possible test items that have as a response choice "all of the above." This allows the student who knows only part of the answer to accurately guess the rest, thereby diminishing validity. For example, if there is a choice consisting of "all of the above," and the student knows that at least two of the answers are correct, even if the third is unknown, the "all of the above' is chosen and is correct. I also tend to avoid "none of the above" as a choice since it involves the process of elimination; I would rather the students show me they know the correct response.

And finally, teachers fall into the trap of favoring certain positions for the correct choices. Students sometimes pick up on the fact that it is either the first of last choice, or somewhere in the middle. Actually, to avoid the problem, I use a set of 3 x 5 index cards labeled 1, 2, 3, and 4. After I construct the choices, I shuffle the deck and use as position the number that comes up. If "2" comes up four times in a row, so be it.

What about marking essays, projects, etc.?
Every study I have ever seen points out that the scoring of essays (and similarly-related products) tends to be very unreliable. What I would like to do is give you a set of procedures for increasing reliability.

First of all, construct the model answer. If the content is highly diversified (as in the case of a project) be sure to include standards of performance or adequacy. This model response consists of what you are prepared to accept.

Second, select from the top, middle, and bottom of the pile about a half-dozen papers (or projects). This avoids sampling problems.

Third, if there are several parts such as a series of essays, read only one question at a time. This eliminates the "halo effect." That is, it might happen that the first response is so good (or bad) it biases your judgment regarding other parts.

Fourth, make written comments (good or bad) as you read. This forces you to concentrate throughout and not fall into the trap of skipping along looking for key words or concepts. Also, if the student questions your judgment you won't have to rack your brain trying to recall precisely what did lead to the decision. Students should be responsible for everything they include.

Fifth, you will find yourself arranging the papers in one of several piles, based on your model. Initially there are usually three: below average, average, and above average, Do not assign a score. Now start reading, then place the paper in one of the piles, entering comments as you did before. Again, no values.

Sixth, if there are more than 10-15 papers, stop and reread through the piles. This is to check and see if your frame of reference has shifted and check to see if you are still consistent.

Seventh, after you have finished (and you may have more than the initial three piles) assign a uniform value to each paper in a a given pile. I have found that I can more easily define scores if those within a given pile have the same value, and are separated by several points from the next pile. Hence, giving Pile A 20 points, and Pile B 16 points is a lot easier to justify than "why did he/she get 18 points and I got 17?"

And yes, that is precisely the way I do it, which is why Psychology 100 is based strictly on objective testing. If you don't want to read essays, don't give them.

When is a test fair?
Most students consider a test fair if it deals with what they were led to believe would be covered based on reading the text, lectures, and discussions. That is why in preparing the test I try to give at least some emphasis to topics presented in lecture. That does not mean that I cannot include items not discussed in lecture, but clearly, in preparing the test, one must keep a sense of perspective. Students who feel that effort and achievement are related, allowing for such differences in ability which they do recognize, are more likely to persist in giving it their best shot.

What do you do with failing students who come to you and want help?
My standard procedure is not to pass on little homilies, but treat the matter as the student's problem. What I do is hand students a copy of the test with the answer sheet, and have them go through it. Then I ask them to describe the errors they made. More often than not, they will tell me that they did not read the text carefully enough, or look through the choices, or pay enough attention in lecture, or tried to do too much the night before. If they still insist they did all the right things, then I sometimes tell them the competition is tougher; the peers who made them look good in high school are no longer there. Finally, if all else does not get to them, I simply point out they may have to spend more time studying. Unless you have reason to suspect the study strategies are horribly inefficient, I would work with what they do best in terms of study habits.

GTP HANDBOOK Publications B