 |

|
 |
 |
Item Analysis
We at the Testing and Assessment Center want to be your partner in Test Item
Analysis!
Item Analysis is a
teaching tool assessment service we offer in conjunction with test
scanning that is a question by question (item by item) analysis of your
test.
Here’s how it works:
Besides an analysis of scores across the entire class, class scores are
also separated into two halves on the basis of total score. We will call
these groups high and low scorers. If there are several people at the
median score, they are assigned unsystematically to different halves.
The assumption is that the total score is the best indication of overall
mastery of the material. This will be true for all but the worst tests.
If this were a placement test, the grade in the target class might be
used for the analysis instead of test score.
The report of this analysis can be broken into three parts, which run
across the page from left to right:
1. Response Frequency
For each question we have
the number of people responding with each option or with no option
(blank). For example, if on question 1, 18 people answered "A" and one
answered "B," an asterisk next to the 18 indicates that the master key
listed "A" as the correct answer. If below the 18, there was "9/9," this
indicates that 9 of the 9 people in the half of the class with the
highest total scores answered "A" to this question and 9 of the ten in
the lower half of the class answered "A."
Two patterns of overall
responding may indicate that the question deserves special attention:
-
If many of the
students gave an answer not indicated by the key.
-
If many of the
students did not answer the question correctly, but the incorrect
answers were distributed among several options.
In either case, the
question should be checked to assure that the key was accurate and the
question reviewed carefully to be sure that it was not misleading or
ambiguous. If the question is determined to be ambiguous, it could be
rewritten and tried again in a future test or class. If this is not the
case and the question was intended as a baseline or to forecast new
material, it indicates that there is a considerable lack of knowledge to
overcome. If the question was intended to show mastery of the material,
the indication is that methods of presenting the topic in class and in
the text need to be reviewed.
If the purpose of the
question was to obtain a baseline of knowledge of a topic and, most
students answer it correctly, it may indicate that the topic does not
need much further coverage.
If the purpose of the question was to emphasize a topic or reinforce
mastery, then one might expect almost all students to get the right
answer. If the purpose of the question was to discriminate between
students on mastery of the material covered, several patterns of
responding need to be given a closer look. If almost all of the students
get the question right or most of them get it wrong, the question is not
a good discriminator. But no matter what proportion of the students got
the question right, if the same proportion (or number) in the top half
of the class and the bottom half of the class got it right, it failed to
discriminate. In other words, it was not one of the items that led to
the differences in total scores. We will go into this further in the
section on discrimination analysis.
2. Difficulty Analysis
This is a summary of the information on the overall frequency analysis.
The NO. RIGHT and WRONG are the number of all students getting the
question right and wrong respectively. The PCT RIGHT and WRONG are the
percentage of all students getting the question right and wrong
respectively. Clearly the number of students will vary with class size
while the percentage should remain relatively constant across classes.
Besides being a quick way to select questions that need further
attention, the percent right (or wrong) can be used to adjust overall
scores and to balance sections of a future test.
For example, if the average percent right of all questions on a test is
75, then the average score on that test could be expected to be 75%.
Further, if a section of questions have the same percent right as those
in another section weighted twice as much; the latter section would be
expected to contribute twice as much per question to the total score.
Conversely, if the section weighted twice as much has an average percent
right of half the other section, both sections will contribute equally
per question to the total score. Either method may be valid. The first
method represents giving more credit for "more important" material. The
second method represents adjusting more difficult questions to give
equal credit.
3. Discrimination Analysis
This section of the analysis summarizes the information derived by
dividing the class into halves based on total score. All five of these
measures typically vary together. In a formal analysis, one or the other
of these measures may be preferred based on the type of question, size
of sample, or target measure if different from total test score. In
analyzing classroom tests, the choice is relatively arbitrary. All the
measures reflect how much better the high scorers did on the item than
the low scorers. Another way of saying this is that they indicate how
much a question contributed to the differences in total score. For all
five if a larger percent of the lower half of the class got the question
right than the upper, the number is negative.
HL DIF (high-low difference) is simply the difference between the number
of students in the top half of the class who got the question right and
the number in the lower half who got it right. It has a theoretical
range from minus half the class size to plus half the class size. Half
the class size indicates that everyone in the upper half of the class
got the question right while no one in the lower half did. Zero
indicates that the same number in the upper and lower half of the class
got it right.
PCT DISCRIM is the high-low difference divided by half the class size.
It ranges from -100 to 100. 100 indicates that everyone in the upper
half of the class got the question right while no one in the lower half
did. Zero indicates that the same number in the upper and lower half of
the class got it right.
X SQ (Chi Square) is a measure relating the proportion of students in
the upper half of the class getting the question right to the proportion
in the lower half getting it right. Zero again indicates that equal
numbers of upper and lower half students got the question right. The
size of the number for any given degree of discrimination increases with
the class size. This means that a "large" chi-square depends on class
size and chi-squares are difficult to compare across classes of
different sizes.
PHI, BIS R, and PT. BIS (Phi, Biserial R, and Point Biserial R) are all
methods of adjusting correlation for the situation in which only certain
values of the measures are possible. Since they are related to
correlation, the range varies from approximately -1 to l. As with the
other measures, zero indicates that the question did not discriminate
between students in the upper and lower half of the class. A 1 would
indicate that all the subjects in the upper half of the class got the
question right and none of the students in the lower half did.
All of these measures apply to those questions meant to discriminate
students on mastery of the material covered. A substantial negative
score on any of them indicates that this question was measuring
something very different from most of the rest of the test. (A question
on mechanics imbedded in a philosophy test might have this relation.)
Questions with measures near zero failed to discriminate between
students. Questions with "high" measures were the best in separating
those who had mastered the material from those who had not.
The next question is "What is high?" For HL DIF and X SQ high depends on
the number of students in the class. For all the measures, high depends
on how many different topics and different kinds of questions were
included in the test. For our purposes it serves to take the highest
questions from a test and add them (or variations of them) to a question
file to be used along with questions that have been modified based on
this analysis and with new questions on subsequent tests as the
discriminating items. Note that in theory, if all your questions were
perfect discriminators on these measures, half of your class would get a
perfect score and the other half zeroes which would probably not be a
satisfactory result.
On the last page under Summary Analysis of the Test, we will only
consider two measures here:
The Correlation of Odd-Even Scores is the correlation between the scores
of two tests, one made up of all odd items, the other made up of all
even items, all items equally weighted. If this score is high (above
.6), it is probable that a shorter test could be constructed that would
be just as discriminating. If this score is low (below .3), it may be
that you need more discriminating items to achieve an accurate measure.
Kuder-Richardson reliability of homogeneity is a measure of internal
consistency or how well the test measures a single factor. If the score
is low, the measures of discrimination cannot be high. If this score is
very low, it probably means that you are trying to test for too many
different factors for the length of test.
Conclusion
A final caveat: The
accuracy of any of these numbers for a particular administration of a
test depends on the number of students in the class. For typical size
classes, a question may receive a measure substantial higher or lower
than it should simply because of the accidental composition of the
class. No question should be rejected without review of the question
itself. If the question seems like it should serve the purpose for which
it was intended but the numbers indicate that it did not, try it again.
If the numbers still come out in the wrong direction, it may mean that
you need to examine the reasons you feel the question should be working
for you.
It is not necessary to keep all the factors mentioned here in mind at
once or to check each of these measures against each question. Each
question should be included to serve some purpose: to discriminate, to
emphasize, to forecast, to obtain a baseline, to reinforce mastery, or
to achieve a level of difficulty. It is not usually practical or even
possible to make a question do all of these. Therefore a question needs
to be checked against the measures that will indicate whether it
achieved its purpose. It is, of course, possible to design a test meant
only to discriminate. Placement tests are often in this class. However,
in a classroom, such a test might not be the best use of a valuable
teaching tool. |