Berkeley's 1973 Graduate Admissions Dataset


The "Berkeley Dataset" contains all 12,763 applicants to UC-Berkeley's graduate programs in Fall 1973. This dataset was published by UC-Berkeley researchers in an analysis to understand the possible gender bias in admissions and has now become a classic example of Simpson's Paradox.

  • Dataset Format: Well-formatted CSV with column headers as the first row
  • Dataset Size: 12,763 rows × 4 columns
  • CSV File Location: https://waf.cs.illinois.edu/discovery/berkeley.csv
  • Dataset Variables:
    • Year : number ➜ The application year (this data is always 1973)
    • Major : string ➜: An anonymized major code (either A, B, C, D, E, F, or Other). The specific majors are unknown except that A-F are the six majors with the most applicants in Fall 1973
    • Gender : string ➜ Applicant self-reported gender (either M or F)
    • Admission: string ➜ Admission decision (either Rejected or Accepted)
  • Research Paper: Sex Bias in Graduate Admissions: Data from Berkeley by P. J. Bickel, E. A. Hammel, and J. W. O'Connell (1975)

Using the Berkeley Dataset in Python

The dataset can be loaded using the pandas library in Python:

import pandas as pd
df = pd.read_csv("https://waf.cs.illinois.edu/discovery/berkeley.csv")
df
YearMajorGenderAdmission
01973CFRejected
11973BMAccepted
21973OtherFAccepted
31973OtherMAccepted
41973OtherMRejected
...............
127581973OtherMAccepted
127591973DMAccepted
127601973OtherFRejected
127611973OtherMRejected
127621973OtherMAccepted

The full Berkeley Dataset stored in a DataFrame (12,763 rows).

Pages Using the Berkeley Dataset

  1. Learn Page: Simpson's Paradox