Item Analysis

Students Taking a TestWe at the Testing and Assessment Center want to be your partner in Test Item Analysis!

Item Analysis is a teaching tool assessment service we offer in conjunction with test scanning that is a question by question (item by item) analysis of your test.

Here’s how it works

Besides an analysis of scores across the entire class, class scores are also separated into two halves on the basis of total score. We will call these groups high and low scorers. If there are several people at the median score, they are assigned unsystematically to different halves. The assumption is that the total score is the best indication of overall mastery of the material. This will be true for all but the worst tests. If this were a placement test, the grade in the target class might be used for the analysis instead of test score.

The report of this analysis can be broken into three parts, which run across the page from left to right:

  1. Response Frequency

    For each question we have the number of people responding with each option or with no option (blank). For example, if on question 1, 18 people answered "A" and one answered "B," an asterisk next to the 18 indicates that the master key listed "A" as the correct answer. If below the 18, there was "9/9," this indicates that 9 of the 9 people in the half of the class with the highest total scores answered "A" to this question and 9 of the ten in the lower half of the class answered "A."

    Two patterns of overall responding may indicate that the question deserves special attention:

    • If many of the students gave an answer not indicated by the key.
    • If many of the students did not answer the question correctly, but the incorrect answers were distributed among several options.

    In either case, the question should be checked to assure that the key was accurate and the question reviewed carefully to be sure that it was not misleading or ambiguous. If the question is determined to be ambiguous, it could be rewritten and tried again in a future test or class. If this is not the case and the question was intended as a baseline or to forecast new material, it indicates that there is a considerable lack of knowledge to overcome. If the question was intended to show mastery of the material, the indication is that methods of presenting the topic in class and in the text need to be reviewed.

    If the purpose of the question was to obtain a baseline of knowledge of a topic and, most students answer it correctly, it may indicate that the topic does not need much further coverage.

    If the purpose of the question was to emphasize a topic or reinforce mastery, then one might expect almost all students to get the right answer. If the purpose of the question was to discriminate between students on mastery of the material covered, several patterns of responding need to be given a closer look. If almost all of the students get the question right or most of them get it wrong, the question is not a good discriminator. But no matter what proportion of the students got the question right, if the same proportion (or number) in the top half of the class and the bottom half of the class got it right, it failed to discriminate. In other words, it was not one of the items that led to the differences in total scores. We will go into this further in the section on discrimination analysis.

  2. Difficulty Analysis

    This is a summary of the information on the overall frequency analysis. The NO. RIGHT and WRONG are the number of all students getting the question right and wrong respectively. The PCT RIGHT and WRONG are the percentage of all students getting the question right and wrong respectively. Clearly the number of students will vary with class size while the percentage should remain relatively constant across classes. Besides being a quick way to select questions that need further attention, the percent right (or wrong) can be used to adjust overall scores and to balance sections of a future test.

    For example, if the average percent right of all questions on a test is 75, then the average score on that test could be expected to be 75%. Further, if a section of questions have the same percent right as those in another section weighted twice as much; the latter section would be expected to contribute twice as much per question to the total score. Conversely, if the section weighted twice as much has an average percent right of half the other section, both sections will contribute equally per question to the total score. Either method may be valid. The first method represents giving more credit for "more important" material. The second method represents adjusting more difficult questions to give equal credit.

  3. Discrimination Analysis

    This section of the analysis summarizes the information derived by dividing the class into halves based on total score. All five of these measures typically vary together. In a formal analysis, one or the other of these measures may be preferred based on the type of question, size of sample, or target measure if different from total test score. In analyzing classroom tests, the choice is relatively arbitrary. All the measures reflect how much better the high scorers did on the item than the low scorers. Another way of saying this is that they indicate how much a question contributed to the differences in total score. For all five if a larger percent of the lower half of the class got the question right than the upper, the number is negative.

    HL DIF (high-low difference) is simply the difference between the number of students in the top half of the class who got the question right and the number in the lower half who got it right. It has a theoretical range from minus half the class size to plus half the class size. Half the class size indicates that everyone in the upper half of the class got the question right while no one in the lower half did. Zero indicates that the same number in the upper and lower half of the class got it right.

    PCT DISCRIM is the high-low difference divided by half the class size. It ranges from -100 to 100. 100 indicates that everyone in the upper half of the class got the question right while no one in the lower half did. Zero indicates that the same number in the upper and lower half of the class got it right.

    X SQ (Chi Square) is a measure relating the proportion of students in the upper half of the class getting the question right to the proportion in the lower half getting it right. Zero again indicates that equal numbers of upper and lower half students got the question right. The size of the number for any given degree of discrimination increases with the class size. This means that a "large" chi-square depends on class size and chi-squares are difficult to compare across classes of different sizes.

    PHI, BIS R, and PT. BIS (Phi, Biserial R, and Point Biserial R) are all methods of adjusting correlation for the situation in which only certain values of the measures are possible. Since they are related to correlation, the range varies from approximately -1 to l. As with the other measures, zero indicates that the question did not discriminate between students in the upper and lower half of the class. A 1 would indicate that all the subjects in the upper half of the class got the question right and none of the students in the lower half did.

    All of these measures apply to those questions meant to discriminate students on mastery of the material covered. A substantial negative score on any of them indicates that this question was measuring something very different from most of the rest of the test. (A question on mechanics imbedded in a philosophy test might have this relation.) Questions with measures near zero failed to discriminate between students. Questions with "high" measures were the best in separating those who had mastered the material from those who had not.

    The next question is "What is high?" For HL DIF and X SQ high depends on the number of students in the class. For all the measures, high depends on how many different topics and different kinds of questions were included in the test. For our purposes it serves to take the highest questions from a test and add them (or variations of them) to a question file to be used along with questions that have been modified based on this analysis and with new questions on subsequent tests as the discriminating items. Note that in theory, if all your questions were perfect discriminators on these measures, half of your class would get a perfect score and the other half zeroes which would probably not be a satisfactory result.

    On the last page under Summary Analysis of the Test, we will only consider two measures here:

    The Correlation of Odd-Even Scores is the correlation between the scores of two tests, one made up of all odd items, the other made up of all even items, all items equally weighted. If this score is high (above .6), it is probable that a shorter test could be constructed that would be just as discriminating. If this score is low (below .3), it may be that you need more discriminating items to achieve an accurate measure.

    Kuder-Richardson reliability of homogeneity is a measure of internal consistency or how well the test measures a single factor. If the score is low, the measures of discrimination cannot be high. If this score is very low, it probably means that you are trying to test for too many different factors for the length of test.


A final caveat: The accuracy of any of these numbers for a particular administration of a test depends on the number of students in the class. For typical size classes, a question may receive a measure substantial higher or lower than it should simply because of the accidental composition of the class. No question should be rejected without review of the question itself. If the question seems like it should serve the purpose for which it was intended but the numbers indicate that it did not, try it again. If the numbers still come out in the wrong direction, it may mean that you need to examine the reasons you feel the question should be working for you.

It is not necessary to keep all the factors mentioned here in mind at once or to check each of these measures against each question. Each question should be included to serve some purpose: to discriminate, to emphasize, to forecast, to obtain a baseline, to reinforce mastery, or to achieve a level of difficulty. It is not usually practical or even possible to make a question do all of these. Therefore a question needs to be checked against the measures that will indicate whether it achieved its purpose. It is, of course, possible to design a test meant only to discriminate. Placement tests are often in this class. However, in a classroom, such a test might not be the best use of a valuable teaching tool.