Effects of Instructor Gender and Ethnicity on FCQ Ratings Given by Students
Perry Sailor, October 2002

Summary of major findings
The study was designed to look for possible effects of instructor gender and ethnicity on FCQ ratings (student ratings of courses and instructors - see an online version of the instrument), after statistically controlling the effects of class level (graduate vs. undergraduate), size, and department. We examined tenured and tenure-track (TTT) and non-TTT instructors separately, and excluded TAs. The major findings of the study are:

  1. Gender differences were inconsistent and for the most part exceedingly small; all reported results below are therefore combined across genders.
  2. The only large effect of instructor ethnicity on ratings is limited to non-TTT Asians (a category which may include both Asian-Americans and non-citizens native to Asia), who are rated much lower than whites.
  3. When the population studied is restricted to instructors who taught at least three sections in the three-year period studied, that effect is largely eliminated and little or no effect of ethnicity on ratings can be detected. To the extent that there is an effect for this limited group, it involves Asians, both TTT and non-TTT (rated .36 and .31 standard deviations, respectively, lower than whites).

We looked at FCQ ratings for all fall/spring terms from 3 academic years - 1999-00, 2000-01, and 2001-02. Each observation on the data file (N=14,677) represented an instructor/section combination, with a mean rating from students in the section on FCQ items 11 (global instructor rating) and 12 (global course rating). The ratings were restricted to those for

  • lectures, recitations, labs, and seminars
  • instructor groups A (tenured and tenure-track, or TTT) and B (other primary instructors, not TTT). Group C (teaching assistants) was excluded.

We statistically removed from each rating the effects of class size, level (undergrad vs. grad), and individual department. This was done by entering these variables as predictors into a SAS regression modeling procedure called GLM (general linear models), and obtaining a predicted rating based on them. This predicted rating was then subtracted from the actual rating, yielding a residual. After converting each section mean rating to a residual in this fashion, a single mean residual for each instructor was then calculated, by averaging the instructor's ratings across all sections taught.

The study included ratings on FCQ item 12, course ratings, as well as item 11, instructor ratings. However, because course ratings and instructor ratings were so highly correlated (r=.91), the remainder of this report will discuss only instructor ratings. All statements and differences reported below that apply to one apply to the other also.

All results below are stated in terms of residual scores. The mean of all the residual scores across the entire population is zero, by definition (see technical note below). This means that a residual of, for example, 0.1 can be interpreted as "0.1 points above average, after adjusting for class size, level, and department." The magnitude of a mean residual, as with any mean score, can be evaluated by comparing it to the standard deviation.

The table below summarizes the results from the study:

Ethnic Group Residual Ratings - All Instructors
N Mean SD N Mean SD
African-Am. 27 .01 .35 21 -.04 .40
Asian 74 -.09 .35 47 -.33 .76
Hispanic 55 -.06 .40 57 -.12 .66
Native Am. 6 -.02 .36 5 .17 .15
Unknown 51 -.07 .43 143 -.16 .55
White 944 .01 .38 1,118 -.04 .51
All 1,157 .00 .38 1,391 -.07 .53

Among TTT instructors, there was little difference between other ethnic groups compared to whites, the largest being the .10 lower rating for Asians, a difference of about .26 standard deviation units (.10/.38 = .26), which is fairly small. However, among non-TTT instructors, the difference between ratings of Asians and whites was considerably larger - the mean rating for Asians was .29 below that for whites, a difference of over half a standard deviation. Non-TTT Asian instructors were scattered across 24 different departments, with only two departments having more than three; furthermore, these two departments - East Asian Languages and Literature, and Economics, with 6 non-TTT Asian instructors each - did NOT contribute much to the extremely low overall mean, since their non-TTT Asian instructors' mean ratings were .02 and -.12, respectively.

A separate analysis was done after eliminating from the population instructors who taught only 1 or 2 sections across the 6 terms. Restricting the analysis to instructors who taught at least three sections sharply attenuated the large negative difference between Asian non-TTT instructors and other groups, and also resulted in a large drop in standard deviation for that group, indicating that most of the negative effect seen in the above table was due to a few instructors who taught only one or two sections each and received extremely low ratings. Perhaps the low ratings they received is the reason they only taught one or two sections - their departments realized they were ineffective instructors and gave them no more teaching assignments. This is just speculation, however.

Ethnic Group Residual Ratings - Minimum 3 Sections Taught
N Mean SD N Mean SD
African-Am. 25 -.01 .36 9 .05 .32
Asian 67 -.11 .36 21 -.13 .40
Hispanic 47 -.04 .39 38 -.07 .59
Native Am. 4 .06 .35 3 .07 .07
Unknown 30 -.11 .36 53 -.08 .42
White 827 .02 .36 646 .00 .42
All 1,000 .00 .37 770 -.01 .43

Technical Note:
Because the residual values were calculated on the individual section mean ratings, before the reduction to one mean score per instructor, the overall mean across instructors, collapsed across sections, will not necessarily be 0; in fact, in this dataset it is -.04 for instructor ratings, -.03 for course ratings.

Other Studies in the Literature

We have not done a systematic search of the higher education literature for other studies in this area. However, a recent study by Centra and Gaubatz (2000) that specifically looked at gender effects (student, instructor, and the interaction between them) reported that results of past studies were inconclusive, with some studies finding no or exceedingly small effects, and a few finding that male students may rate female instructors lower than male instructors.

Centra and Gaubatz's own study of gender bias used data from 741 classes from a variety of institutions, all using a common evaluation form developed by the Educational Testing Service. In their analyses of students in the same class rating either a female or male instructor, they found that female instructors received higher ratings from female than from male students on 6 of 8 scales, including a global rating. The differences were statistically significant but very small (about a quarter of a standard deviation), and thus of little practical utility. Male instructors received the same ratings from male and female students.

In comparisons across classes, female students rated female instructors higher on some scales, male students rated male instructors higher on some others, but global ratings did not differ by instructor or student gender. And the differences were again very small, on the order of a quarter of a standard deviation, and thus of no practical effect.


Centra, J.A., & Gaubatz, N.B. (2000). Is there gender bias in student evaluations of teaching? Journal of Higher Education, 70 (1), 17-33.

Last revision 05/18/16

