Office of Educational Assessment

Using Results

IASystem™ captures and summarizes student ratings instruction to support both pedagogical and programmatic decision-making. Following are recommendations for using course evaluation data to make both types of decisions, and considerations related to data quality that support those recommendations based on empirical research conducted by the Office of Educational Assessment.

Guidelines for Decision-Making

Pedagogical Decisions

At the end of every academic term, instructors make changes to their courses and instruction based on student feedback. These are not high stakes decisions; the items that are used are specific to the course format, changes are made based on the particular item content, and adjustments in the course are made continuously over time.

Guidelines for pedagogical decisions
Base judgements on:   Why?
  • ratings of “formative” items
  • ratings of individual items
  • ratings of individual classes
  • average ratings of at least 7 students
Item Content
Item Reliability
Item Reliability
Item Reliability

Programmatic Decisions

The decision-making process must be more rigorous for making changes in departmental curricula or in merit, promotion, and tenure decisions. To maximize the validity of using student ratings results in making decisions around either courses or instructors, it is important that departments have systematic and well-articulated policies and practices, and that rating results in conjunction with other types of evaluative information.

Guidelines for programmatic decisions
Base judgements on:   Why?
  • ratings of “summative” items
  • adjusted rather than raw medians
  • ratings of combined items
  • ratings of at least 5 classes
  • a minimum difference of .3 when comparing average ratings
Item Content
Bias Control
Item Reliability
Item Reliability
Item Reliability

Data quality

Item Content

IASystemevaluation forms are composed of three sets of items to support formative and summative decision-making, and to capture additional information to assist instructors in interpreting evaluation results. These purposes are reflected in the structure and content of IASystem™ items.

Bias Control

Analysis of student ratings data reveals that student course ratings are influenced by several factors. The best known of these is student reason for taking the class, class size, and expected grade. The amount of bias is reflected in the magnitude of the inter-correlations between each of these factors and the ratings awarded. IASystem™ corrects for observed bias by utilizing regression analyses to 1) examine the pattern of inter-correlations in student ratings data, and 2) compute an “adjusted” rating for each the four summative items as well as the combined rating. As an example, the formula for computing the adjusted median for the first summative item is:

AdjMed1 = Median 1 – (2.487 + .003292ER – .143LS + .337RG – 3.8829)

Additional factors are emerging from ongoing research, including student demographics (sex and race/ethnicity), rating modality (online vs. paper), and class mode (online vs. face-to-face).

Item Reliability

Item reliability plays a key role in the appropriate use of course evaluation results in making both pedagogical and programmatic decisions. The higher the reliability, 1) the more confident we can be that average ratings reflect student opinions about a class rather than random error, and 2) the smaller the difference between ratings required for statistical significance. Two types of reliability estimates are appropriate for IASystem™ items.

Inter-rater reliability reflects the consistency of ratings by individual students (i.e., the student is the “unit of measure”). Reliability ratings range from 0.0 (completely unreliable) to 1.0 (completely reliable). Reliabilities above 0.7 are considered “high” and we have adopted this value as the minimum for making instructional improvement decisions. Reliability increases with the number of raters, and as shown by the graph below, adequate reliability is achieved when there are at least 7 students in a class. (Based on analysis of two years of data from UW Seattle using Spearman-Brown adjustments.)

Inter-RATER reliability for formative (single-class) decision-making
Reliability-InterRATER

Inter-class reliability reflects the consistency of ratings by across classes. It is computed similarly to inter-rater reliability but the data analyzed are average class ratings (i.e., class is the “unit of measure”). For high stakes decisions such as those relating to faculty merit, promotion, and tenure, we recommend maximizing reliability by combining ratings of multiple courses and using the combined median of the four summative items. We also increase the minimum criterion to 0.8. As the following graph shows, adequate reliability is achieved when at least 5 classes are combined.

Inter-CLASS reliability for summative (combined classes) decision-making
Reliability-InterCLASS