Using assessment data

The most common output generated by assessment tools are response rates and achievement mastery levels. Data elicited from a multiple choice instrument contains many pieces of information that can help the instructor identify strengths and weakness of the student and the instrument itself. The figure below shows a sample of what type of data can be identified from scanned multiple choice assessments. This particular format is provided as a result of Scantron evaluation system supported by ParSCORE.

  • Mean, median, and mode scores as well as standard deviations are compiled from the 15 participants in this grouping. These scores provide the instructor with comparative results of the group as a whole.
  • The measure of skewness refers to the measure of symmetry within a particular data set.  For a normal distribution, the measure of skewness is at or near zero. Negative values indicate data that are skewed left and positive values indicate data that are skewed right.
  • The degree of kurtosis indicates whether the data set is peaked or flat, compared to a normal distribution. Data sets with high kurtosis are characteristic to peak near the mean, and decline rapidly. Data sets with low kurtosis are illustrative of a flat plateau near the mean instead of a sharp peak. engage
  • A reliability coefficient (KR20) is correlation coefficient between two sets of scores. These two sets could be a baseline and summative collection, two baseline or summative collections, etc. A coefficient of 0 indicates no relationship between two sets of scores. A coefficient of 1 would be indicative of the same score on both administrations. The approximate range of reliability coefficients can be described as follows:
    • .90 or higher – High reliability.
      Suitable for making a decision about an examinee based on a single test score.
    • .80 to .89 – Good reliability.
      Suitable for use in evaluating individual examinees if averaged with a small number of other scores of similar reliability.
    • .60 to .79 – Low to moderate reliability.
      Suitable for evaluating individuals only if averaged with several other scores of similar reliability.
    • .40 to .59 – Doubtful reliability.
      Should be used only with caution in the evaluation of individual examinees. May be satisfactory for determination of average score differences between groups.
Instructor: Wyle E. Coyote
Class: Intro. to Epidemiology
Time/day: 3/20/10
Test: Exam 1
Form: A
Total Possible Points: 63
Students in this Group 15
Mean score: 49.27
Median Score: 49.00
Mode Score: 55.00
Variance: 16.73
Standard Deviation: 4.09
Measure of Skewness: 0.15
Degree of Kurtosis: -1.39
Highest Score: 55
Lowest Score: 43
Reliability Coefficient (KR20): 0.59
Standard Error of Measurement: 2.62

Another piece of information provided is the individual response statistics for item quality. Once the entire test is scored, areas to pay particular attention to are the Correct Responses as a Percentage and Distractor Analysis Fields.

The Correct Responses as a Percentage of section details the following items:

  • Item#: Question number within the test.
  • Total Group: The percentage of the total students who responded correctly to the item.
  • Upper 27% of Group: Of the upper 27% of the students in the class, the percentage of students who responded correctly to the item.
  • Lower 27% of Group: Of the lower 27% of the students in the class, the percentage of students who responded correctly to the item.
Correct Responses as a Percentage of Discrimination Distractor Analysis
Item# Total Group Upper 27% of Group Lower 27% of Group Alt. % Correct Biser. Pt-Biser.
1 100 100 100 0.00 A 0% 0.00 0.00
*B 100.00 0.00 0.00
C 0% 0.00 0.00
D 0% 0.00 0.00
E 0% 0.00 0.00
2 20 25 0 0.33 A 60% -0.10 -0.08
*B 20% 0.48 0.33
C 13% -0.65 -0.41
D 7% 0.34 0.18
E 0.00 0.00 0.00
3 40 50 50 -0.02 A 13% 0.57 0.36
Review this Item! B 0% 0.00 0.00
C 47% -0.28 -0.22
*D 40% -0.03 -0.02
E 0.00 0.00 0.00

Discrimination measures the effectivenessof a question. It discriminates between those who have mastered the material and those who have not. It also determines question effectiveness: low, medium, or high.  These indices are shown below:

      • < 0.00 (Negative) Unacceptable – check item for error
      • 0.00 – 0.24  Room for improvement
      • 0.25 – 0.39  Good item
      • 0.40- 1.00    Excellent item

Within the Distractor Analysis section, the correct answer (Alt.), the percentage of the class who selected a particular distractor (% correct) and Point Biserial (Pt-Biser.) coefficients are listed. The Point-Biserial coefficient is the correlation between the score of an item and the total score on a test. In essence, it details how well an item predicts student performance on the entire exam by comparing how well students did answering one question, relative to how well they did answering all the questions. The scores range from plus and minus one. The scale below reflects the ranges of Pt. Biser scores:

Scale Range Indication
.30 or above very good test distractor
.20 to .29 reasonably good test distractor
.09 to .19 needs improvement
below .09 poor test distractor

If it is a low positive or negative it can be used to identify problematic areas such as:

  • A questionable correct answer
  • > 1 correct answer
  • No real correct answer
  • An ambiguous or confusing question stem

For more information and details in how to read the Scantron ParSCORE item analysis report, view the Understanding Statistical Information on Item Analysis Reports tutorial.


Bontempo, B. (2009). MMLog:The Point-Biserial Correlation Coefficient, Retrieved October 15, 2010 at

Frary, R.B. (2010).A Simulation Study of Reliability and Validity of Multiple-Choice Test Scores Under Six Response-Scoring Modes. Journal of Educational and Behavioral Statistics, 7(4), 333-351.

Kehoe, J. (1995). Basic item analysis for multiple-choice tests. Practical Assessment, Research & Evaluation, 4(10).