Using IASystem to make decisions

Evaluation of programs and processes can be classified as either formative or summative depending on whether the purpose of the evaluation is to enable ongoing improvement or to provide an end-point judgment of quality.  Different types of data support different types of decisions.

  Formative Evaluation Summative Evaluation
Purpose to improve process to evaluate process
Timing while process is on-going (interim feedback) at end of process
Quality of data may be somewhat informal rigor should reflect the significance of decision

An example of formative decision-making relative to a particular class would be mid-quarter changes made by the instructor in response to informal student feedback.  The end-of-quarter Course Summary Report could support summative judgments about how the course as a whole went that quarter (considering especially average ratings of the global Items 1-4).  However, the line between formative and summative evaluation often blurs, as it does here because summative course evaluations often provide faculty with formative information for making changes in their courses before they offer them again.  Possible uses of other IAS reports for formative and summative evaluation of professional development of instructors and evolution of academic programs are suggested below.

  Formative Summative
Class Informal poll, Observations Course Summary Report
Instructor Course Summary Report
High-Low Report
Custom Summary Report
Trend analyses
Program Annual Report
Custom Summary Report
5-Year Summary Report

Whether making formative or summative decisions, it is essential to know if observed differences among mean ratings are statistically significant.  The higher the item reliability, the smaller the difference required for significance.  The way that reliability is computed depends on whether the means under review are those for a single course or for combined courses.

Individual Courses

When making course improvements, an instructor may want to compare median ratings of two items relative to the same course, or ratings of two different courses on the same item.  Item medians would be those reported on the respective Course Summary Reports and the appropriate index of reliability would be inter-rater reliability.  If a median is computed on ratings by a small number of students, it is more likely to be influenced by a single extreme rating than if it were computed on a large number of ratings.  For this reason, inter-rater reliability varies according to the number of students in the class.  The figure below shows the reliabilities of Combined Items 1-4, the average reliability of items requiring ‘absolute’ ratings (items 1-22, 28, 29), and the average reliability of items rated ‘relative’ to other college courses (items 23-27).1

Inter-rater reliability for course evaluation items according to number of raters

As shown in the figure, items rated on an absolute scale (including Combined Items 1-4) achieve a high level (.70) of reliability at class sizes of approximately 7-10, whereas items that ask students to rate the class ‘relative’ to other college courses require a class size of approximately 20.2 These data suggest the following interpretive guidelines for formative decisions based on single class ratings:

  • 7-10 students minimum for ‘absolute’ items (1-22, 28, 29)
  • 15 students minimum for ‘relative’ items (23-27)
  • Judgments based on individual items are OK because use is for formative decisions

Combined Courses

For administrators making summative decisions about curriculum revision or instructor promotion and tenure, the question is not how reliable student ratings are relative to one particular class, but the extent to which the ratings for an instructor are reliable across several classes.  The appropriate form of reliability for these types of comparisons is inter-class reliability (i.e., the stability of ratings across classes).  Reliability can be maximized by combining items into a single composite.  For this reason, IAS provides the Combined Items 1-4 for which inter-class reliabilities are shown below.3

Inter-class reliability for items 1 through 4 according to number of classes

As the figure shows, inter-class reliability of Combined Items 1-4 increases as a function of the number of classes.  For this reason, a minimum of five (preferably seven) classes should be combined to make summative judgments.  In addition, small differences should be considered with caution.  Based on the overall distribution of combined scores from Items 1-4, the minimum difference that is statistically meaningful is 0.3 points.  Historical data suggest the following interpretive guidelines for summative decisions based on combined class ratings:

  • Base decisions on Combined Items 1-4
  • 5-7 classes minimum
  • 7-10 students per class minimum
  • Differences between combined ratings of less than 0.3 points are not statistically meaningful

Because of the importance of summative judgments to faculty careers, it is essential that student ratings of instruction be evaluated thoughtfully as part of a systematic departmental assessment program. 


1 Lowell, N. and Gillmore, G.M.  “Reliability of the items of the Instructional Assessment System:  Forms A-G.” OEA Report 91-01.

2 Reliability coefficients range from 0.0 to 1.0, with higher numbers indicating more agreement between raters and lower coefficients indicating less agreement.  As a general rule of thumb, low, medium and high reliability are referenced by coefficients of .00-.40, .40-.70, and .70-1.00, respectively.

3 Gillmore, G.M.  (2000)  “Drawing inferences about instructors: The inter-class reliability of student ratings of instruction.”OEA Report 00-02.
McGhee, D.E.  (2002  “Drawing inferences about instructors: Constructing confidence intervals for student ratings of instruction.”OEA Report 02-05.