Validity & Reliability


The development of the evaluation forms has relied heavily on content validity. In creating the initial forms, the research team visited classes and interviewed faculty and administrators about their teaching strategies and the information they would find most valuable for teaching improvement and most appropriate for personnel decisions. Student groups were likewise interviewed to determine their informational needs. Rating forms were developed based on these initial interviews. Draft forms were taken to faculty and to students who were asked in formal interviews to critique each item, relative to a specific course taught or taken. The final content of the forms was determined from these interviews. Nationally, a large number of empirical studies have established the validity of student ratings of instruction using forms similar to those used in the IASystem™.


Inter-rater reliabilities were computed from data from 13,345 UW classes (with 10 or more students completing forms) surveyed from fall quarter 1995 through spring quarter 1997. Data from identical items were combined across forms, and a single coefficient is reported. Sufficient data are not yet available to compute reliabilities for Forms I and J, or for items delivered online. Inter-class reliabilities for assessing instructor-level discrimination were computed for common items (see OEA Report 00-02). We concluded that decisions about faculty for personnel matters could be reliably made if ratings were collected from seven or more classes.

General Assessment Items. The average of Items 1-4 provides a general assessment of the course and is reported for each instructor along with responses to the individual items. The reliabilities of ratings of items 1-4 were very high, ranging between .84 and .90. The reliability of the four-item average was .80 for Form G, and .87 for all other forms.

Form-Specific Items. Each evaluation form contains a block of items specific to that form. For Forms A through H, items 5-15 are directed toward a particular type of course. The same item may or may not be used on more than one form, depending on the similarity of the course types. The reliabilities of ratings of items 5-15 tended to be high. The majority of the item ratings yielded coefficients of .80 or greater.

Forms A through H contain a common set of items (items 16-22) relating to student perceptions of the course requirements. Coefficients were .80 or higher for eight out of the eleven items.

For Form X, items 16-22 are directed at assessing educational outcomes and are not found on other forms. Reliabilities for these ratings tended to be somewhat lower than those of other items, ranging between .71 and .79.

Academic Demand Items. Items 23-30 are common to all forms and focus on expected grade, intellectual challenge, and required workload. Reliabilities ranged from .72 to .89, with the exception of item 27 for which the reliability was .60.

The table below provides a summary of inter-rater reliability coefficients for an average class size of 20 students. Some items appear on several forms and the corresponding reliability coefficients were computed from combined data. Other items are unique to a particular form. Brackets and parentheses are used to indicate the correspondence of rating form and coefficient.

Inter-rater Reliabilities by Form

Items Forms Reliability Coefficients
1 All forms .89
2 A - F, H, X (G),[J] .87 (.87) [.84]
3 A - F, H, X (G) .90 (.89)
4 A - F, H, X (G) .90 (.89)
Average of 1 - 4 A - F, H, X (G) .87 (.81)
5 - 15 Each form calculated separately 96% were .80 or above
64% were .85 or above
16 A - F, H, (G), [X] .87 (.83) [.78]
17 A-H, [X] .87 [.80]
18 A-H, [X] .85 [.76]
19 A-H, [X] .87 [.80]
20 A-E, (F), [G], {H}, (X) .85 (.81) [.80] {.87} (.75)
21 A-H, [X] .86 [.79]
22 A - F, H (G), [X] .85 (.85) [.78]
23 All forms .80
24 All forms .84
25 All forms .82
26 All forms .83
27 All forms .77
28 All forms .90
29 All forms .86
30 All forms .83

Inter-rater reliability coefficients represent the level of agreement among students on the ratings of individual classes relative to mean differences across classes. Values range from 0 to 1, where 0.0 indicates that there is no agreement among students, and 1.0 indicates perfect agreement. The reliability coefficients are intraclass correlations, using classes as the unit of analysis, and were computed using the following formula:

where MSB is the mean square between classes, MSW is the mean square within classes, and k is the average class size.

The above formula yields the reliability for a single rater or student. As the number of students providing a rating increases, the reliability of the mean rating also increases. Coefficients can be computed for varying class sizes using the Spearman-Brown formula:

where rk is the reliability of k student ratings and r1 is the reliability of a single rating.