Course Evaluations
In response to community concerns regarding bias in course evaluations, OEA engaged in a large-scale, quantitative self-study to investigate IASystem data for evidence of systemic bias by instructor characteristics. Using multilevel modeling, we reviewed ratings from over 52,000 UW classes taught by over 4,000 instructors from Fall 2017 to Summer 2021. We found that models built with course and instructor characteristics are poor predictors of course evaluation measures, explaining less than 30% of the variance of unadjusted median ratings between instructors. These findings suggest that other factors related to the course experience are more predictive of course ratings. This study also confirmed findings from previous OEA reports listed below on the persistent impact of evaluation response rate, expected grade, class size, and the percent required to take the course on ratings. Results on instructor characteristics were mixed; we found some evidence that course ratings differ by instructor race and gender in statistically significant ways, with specific and consistent differences among course ratings for non white instructors. However, instructor characteristics explained much less of the ratings between instructors than what we expected to find given results from previous literature and the prevailing discourse regarding student evaluations of instruction.
A slide deck summarizing the results can be found here.
This study continued our investigation of factors influencing student ratings of instruction. Results underscored the importance of maintaining a high rate of student response to reduce response bias. Earlier we reported that classes evaluated using paper forms received slightly (but statistically significantly) higher ratings than those evaluated online. The present study confirmed those findings but noted that the observed difference between online and paper ratings was minimal and could be attributed to differences in response rate. The difference in ratings due to evaluation mode was not large enough to be visible on IASystemTM reports, or to support statistical correction for bias. Student response rate had a much larger impact on average rating; the difference between the lowest and highest response rate quintiles was M = .37, compared to a difference of M = .05 between paper and online ratings. We emphasize the importance of the implementation of deliberate practices on the part of both faculty and departments to maximize student response rates for all course evaluations, particularly those conducted online.
We examined the relationships between course evaluation ratings and: a) evaluation mode (via paper vs. the internet) and b) course delivery mode (face-to-face, internet, or hybrid) for autumn quarter 2014. Ratings given online tended to be lower than those given on traditional paper forms. Likewise, course sections that were delivered solely via the internet were rated lower than both face-to-face and hybrid sections. However, this result did not hold when analyses were restricted to only those courses which offered both face-to-face and online sections. Taken as a whole, the findings suggest that lower average ratings obtained for internet ratings and among online sections are likely the result of response biases rather than true differences in instructional quality.
In response to expressed concerns regarding the degree of academic challenge posed by UW courses, OEA undertook to develop a single index of challenge and student engagement based on items from Instructional Assessment System (IAS) course evaluation forms. Although a set of items specifically directed at this topic had been added to IAS forms in 1998, we felt that a single index might provide a simpler and more powerful representation for individual courses. Also, because the IAS is used to evaluate a large percentage of courses taught at the UW, the index could provide useful insight to more general student perception of UW educational experiences.
There is strong interest at the University of Washington in providing a positive environment for all faculty, staff, and students. This report describes development of a questionnaire assessing classroom climate that can be administered in conjunction with quarterly evaluation of university courses.
This study compared mean ratings,inter-rater reliabilities, and the factor structure of items for online and paper student-rating forms from the University of Washington’s Instructional Assessment System.
This report expands upon an earlier discussion of instructor-level reliability of course ratings. Gillmore (2000) previously demonstrated that adequate instructor-level reliability may be obtained when ratings are aggregated across at least seven classes. What was left unexamined, however, was the precision with which one should regard mean ratings. This brief report presents confidence intervals for true scores based on Instructional Assessment System (IAS) data from approximately 4,000 instructors.
This report is based on a presentation by Dr. Gerald Gillmore, Director of the UW Office of Educational Assessment, at the Second Campus-wide Forum on Student Expectations and Demands, which took place on April 26, 2001. The purpose of these brief remarks were to present what students tell us about demands and expectations via their evaluations of classes using the Office of Educational Assessment Instructional Assessment System (IAS). The following four facts are discussed:
- Students put more effort into classes that demand more effort for them to be successful.
- Students tend to prefer more challenging classes over less challenging classes.
- The widely held belief that assigning students more work will lead to lower student ratings is not true in and of itself.
- It is clear that all faculty are not equally demanding. In fact, there are considerable differences among faculty in the amount of time students devote to their courses.
The question addressed in this report is whether there is sufficient consistency in student ratings of instructors to support the use of data aggregated over classes for personnel decisions. Instructional Assessment System (IAS) data from over 2,800 instructors teaching over 23,000 classes were analyzed. Results showed adequate instructor-level reliability of ratings when aggregating across about seven classes and especially strong instructor-level reliability when aggregating across 15 or more classes. However, these results assume certain conditions of decision-making and are limited to similar conditions of measurement.
Using statistical adjustment to reduce biases in student ratings.(response to articles by J.S. Armstrong et al. in American Psychologist, vols. 52-53, 1997-98). G.M. Gillmore and A.G. Greenwald. The American Psychologist, July, 1999, Vol.54(7), p.518(2).
Arguments concerning student ratings range from endorsing ratings as largely valid and broadly useful to characterizing them as invalid and harmful to instruction. The authors support an intermediate position recognizing some validity of student ratings, acknowledging their useful role in giving students voice in the evaluation of instruction, and stressing the possibility of improving the validity of ratings by statistically removing identifiable biases.
No Pain, No Gain? The Importance of Measuring Course Workload in Student Ratings of Instruction. A.G. Greenwald and G.M. Gillmore. G.M. Pressley, (editor), Journal of Educational Psychology, 1997, Vol.89(4), pp.743-751.
A covariance structure model assessed the effect of expected grades and course workloads on evaluative ratings. The model was developed and tested over three academic terms using data from 200 undergraduate courses, and demonstrated that (a) courses that gave higher grades were better liked (a positive path from expected grades to evaluative ratings), and (b) courses that gave higher grades had lighter workloads (a negative relation between expected grades and workload). These findings support the conclusion that instructors’ grading leniency influences ratings.
Grading Leniency Is a Removable Contaminant of Student Ratings. A.G. Greenwald and G.M. Gillmore. R.D. Fowler, Raymond D. (editor), American Psychologist, 1997, Vol.52(11), pp.1209-1217.
This study identified four data patterns that together discriminate among five theories of the grades-ratings correlation. The presence of all four of these markers in student ratings data from the University of Washington suggests that the grades-ratings correlation is due to an unwanted influence of instructors’ grading leniency on ratings. The effects of this leniency can be removed by means of a statistical correction of student ratings data.
The purpose of this study was to better understand the effects of grades and measures of course difficulty on student ratings of instruction. It was based on ratings from 337 UW classes fall 1993, using the newly developed Form X. The study found that students’ ratings are positively influenced by three factors, in order of importance: student perceptions of the ratio of valuable hours to total hours in the time put into the course, the challenge of the course, and the leniency of grading.
The University of Washington (UW) was among the earliest institutions to systematically evaluate courses using student ratings. The first efforts at the UW were initiated in the 1920’s, and over the years different methods of collecting and reporting ratings have been used. The Instructional Assessment System (IAS) was introduced in 1974 and has grown considerably in use since that time. This report presents item means and reliability estimates for IAS evaluation items based on data gathered at the UW main campus during the 1989-90 academic year. These data represent more than 150,000 ratings forms, evaluating nearly 7000 classes.
The Generalizability of Student Ratings of Instruction: Estimation of the Teacher and Course Components. G.M. Gillmore, M.T. Kane, and R.W. Naccarato, Journal of Educational Measurement, 1978, Vol.15(1), pp.1-13.
The Generalizability of Class Means. M.T. Kane and R.L. Brennan. Review of Educational Research, 1977, 47, 267-292.
Student Evaluations of Teaching: The Generalizability of Class Means. M.T. Kane, G.M. Gillmore, and T.J. Crooks. Journal of Educational Measurement, 1976, 13, 171-184.
The Generalizability of Student Instructional Ratings: General Theory and Application to the University of Washington Instructional Assessment System. G.M Gillmore, M.T. Kane, and R.W. Naccarato, EAC Reports. Seattle, Washington: Educational Assessment Center, University of Washington, 1976.
Generalizability and the Interpretation of Student Evaluations of Teaching. M.T. Kane, T.J. Crooks, and G.M. Gillmore. Paper presented at annual meeting of the American Educational Research Association, San Francisco, March 1976.
A Brief Description of the University of Washington Instructional Assessment System. G.M. Gillmore, EAC Reports. Seattle, Washington: Educational Assessment Center, University of Washington, 1974.