Item Analysis
By: Victor V. Wiesner, Ph.D., LPC-S, NCC, CCMHC
Click here to contact Victor and/or see his GoodTherapy.org Profile
In classical test theory, item analysis traditionally depends on the two concepts of item difficulty and item discrimination. Item difficulty is the percentage (expressed in decimal point format) of test takers who correctly respond to a test item either by getting the answer correct or by endorsing the trait or characteristic under examination. It is reported as a p-value (ranging from 0 to 1.00) and is calculated by dividing the number of persons who correctly answered the item by the number of test takers. Higher numbers mean the question is easier. Item difficulty levels are known as p-values but this should not be confused with the same name used in connection to levels of statistical significance.
The “correct” answer for psychological assessment instruments measuring constructs would simply be an answer that endorses the construct. For example, on an instrument measuring depression, a reply that positively signifies a depressive symptom would be a correct answer. An item that queried, “Do you find yourself often discouraged?” would have a higher item difficulty level (more would “pass” it) than the question, “Do you frequently have thoughts about suicide?”
Test authors calculate item difficulty in order to improve the instrument. Teachers frequently compute item difficulty in consideration of grade changes. Sometimes an item will appear to be easy or hard to the test maker but when item difficulty is actually calculated after the test is scored and reviewed, the test creator is sometimes surprised. Sometimes an “easy” question will confuse high performing students because of poor wording, or a “hard” question will prove too easy such that those without true knowledge can guess at the answer. In the later case, distractors (also known as foils) of poor quality may be to blame.
An item with a p-value of 0.9 is considered very easy as 90% of test takers correctly answered or endorsed the item. Likewise, an item with a p-value of 0.15 is very difficult as only 15% chose the “keyed” answer or positively endorsed the trait/attitude/characteristic addressed.
In general, items with difficulty levels or p-values of 0.5 will yield the most variation in test score distributions. In choosing the level of item difficulty or trying to set a range of p-values for an item set, the purpose of the test and the population being tested are among considerations. To provide maximum differentiation all items should cluster around 0.5. This would make the test mean around 50%, which is too low for many practical purposes such as licensing, certification, or university classroom tests. When a test is too difficult for the group the test has an inadequate floor and when the test is too easy it has an insufficient ceiling.
Persons possessing a high degree of the trait would score higher on each item than those who did not possess the trait, or only possessed it at a low level. If this is the case, then there is positive item discrimination. To use the example above, we may want to differentiate/discriminate people into two groups: those who are depressed from those who are not depressed. One way to measure how well items discriminate is to calculate the discrimination index. Each item will have a discrimination index.
A discrimination index can be calculated for each item using the extreme groups method, the proportion (0 to 1.00) of correct answers from those persons with a total score in the lower range of the entire group would be subtracted from the proportion of correct answers from those persons scoring in the upper range. Proportions from the group of persons with a total score in the top 27% should be compared with the proportion from those scoring in the bottom 27%. Simply put, for each item, each group (i.e., the highest 27% and the lowest 27%) has its own item difficulty and an index of discrimination involves subtracting the item difficulty of the low scoring group from the item difficulty of the high scoring group. The resultant calculation for each item will yield the index of discrimination (D) that ranges from -1.00 to +1.00. Note D is one only type of index of discrimination. This method of determining whether an item differentiates between those who truly have more or less of the trait is an internal method. External methods involve using an external criterion to inspect for differentiation/discrimination.
The correlation of an item with the total test score (internal method) or with an external criterion (external method) is yet another way to investigate the degree of item discrimination. There are a variety of correlational indexes depending on the nature of the variables. The most common correlation is the point biserial (rpb) correlation which is used when the criterion measure (e.g., total score) is continuous and the item scores are dichotomous (e.g., correct-incorrect). The point biserial correlation coefficient ranges from -1.00 to +1.00.
With both the index of discrimination and correlational indexes, a positive value indicates positive item discrimination, a negative value indicates negative discrimination, and low values indicate low or no discrimination. Items demonstrating high positive discrimination (e.g., over .50) would probably be retained while those with negative or low absolute values should be rejected, unless the item is checking for mastery and all or nearly all examinees are expected to mark the item correct. Items with a discrimination index of between .20 and .50 should probably be modified. Of course, overall item difficulty serves as a mediator of the index of discrimination; that is, easier items ordinarily have lower discrimination indexes because item difficulty truncates or suppresses the range of scores on one of the variables, thus lowering the resulting correlation or index.
Item discrimination indexes are sometimes used as indexes of item validity. For nearly all tests, the most essential quality for items to have is the power of discrimination. Very easy items do not discriminate well, but one may want to include them in classroom tests for motivational reasons, especially at the beginning of a test. Should nobody correctly answer an item then D would be equal to zero. Items that are extremely difficult may be desired if a critical objective is being measured, even if the discrimination happened to be low. In most cases a test maker wants items to be of moderate difficulty because these show the most positive discrimination.
In classroom tests, it is important to check that the false answers (distractors) are positively discriminating as well. Occasionally only the correct or keyed answers are examined in item analysis, but this should definitely not be the case. All answers should be checked to be sure they are contributing to the discrimination ability of the item.
The statistics and discussion above is based on classical test theory, but item response theory (IRT) uses many similar concepts. The item characteristic curve (ICC), sometimes called the item trace curve, is used in IRT to analyze items. In the item characteristic curve the strength of the attribute is represented on the horizontal axis and the likelihood of passing the item is scaled along the vertical axis. Typically, the ICC will take on the shape of an “S”. If the examiner extends a line from 0.50 of the vertical axis (representing a 50% chance of passing or endorsing an item) to the ICC one intersects the ICC at a point representing a certain strength of the attribute. This 50% point of each item will be matched with a strength point. The further out on the horizontal axis (i.e., the greater the strength of the attribute), the more difficult the item is. Thus the ICC provides a wonderful visual of item difficulty for each item (see figure 1).
At this same pivotal point on the graph, the ICC will have a certain slope. The steeper the slope, the more item discrimination is demonstrated. An ICC that looks like an elongated “S” leaning to the right will provide less item discrimination than an ICC with a straighter and steeper “S” shape. If the ICC was very flat that would indicate almost no item discrimination and if the ICC is the shape of a backward “S” then negative item discrimination would be evidenced.
Figure 1. Item characteristic curve.

If one were to graph many items on the same graph, the items shifted to the right would be the more difficult items. Where the ICC intersects the vertical axis the strength of the attribute is either zero or is at least at a minimum. The likelihood of passing the item at this point is equivalent to a “false positive” – meaning that someone with very little or none of the characteristic in question would have this likelihood of “passing” or endorsing the item. Ideally this likelihood for an item would be zero, although if there are five possible answers one might expect the false positive likelihood to be 0.20 because of chance (if the question is a measure of knowledge).
The item analysis concepts discussed above are basic and commonly used in test development, but there are other more advanced item analysis concepts not discussed. A few of the topics not discussed include item bias, qualitative item analysis, factor analysis, computerized adaptive testing (CAT), differential item functioning (DIF), considerations of speed, and various software programs used in item analysis.
Contributed by Van Wiesner, Sam Houston State University, Huntsville, TX
Additional Resources
Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River NJ:
Prentice Hall.
DeVellis, R. F. (2003). Scale development: Theory and applications (2nd ed.). Sage Applied
Social Research Methods Series, Vol. 26. Thousand Oaks, CA: Sage.
©Copyright 2008 by Victor V. Wiesner. All Rights Reserved. Permission to publish granted to GoodTherapy.org. The following article was solely written and edited by the author named above. The views and opinions expressed are not necessarily shared by GoodTherapy.org. Questions or concerns about the following article can be directed to the author or posted as a comment to this blog entry.
Click here to contact Victor and/or see his GoodTherapy.org Profile