An anonymous source sent me this file (csv) of data related to Williams admissions.

[UPDATE: Data removed at the request of the College.]

> library(readr)
> x <- read_csv(file = "http://ephblog.com/wp-content/uploads/2017/09/admissions.csv")
> x
# A tibble: 2,110 x 10
   class enrolled state       country      ethnicity   sex   act reading  math writing
                                    
 1  2017        0    AZ United States Asian American     M    NA     770   790     770
 2  2019        0    AZ United States Asian American     F    35     730   770     760
 3  2019        0    AZ United States Asian American     M    NA     800   720     800
 4  2019        0    BC        Canada Asian American     F    NA     800   750     750
 5  2013        1    CA United States Asian American     F    NA     790   800     800
 6  2013        1    CA United States Asian American     M    NA     760   780     790
 7  2013        1    CA United States Asian American     M    NA     790   800     710
 8  2013        1    CA United States Asian American     F    NA     650   590     670
 9  2014        1    CA United States Asian American     F    NA     790   780     720
10  2014        1    CA United States Asian American     F    35     750   800     700
# ... with 2,100 more rows
> 

Comments:

1) Does this look real to you? It does to me, although it is obviously just a sample. Opinions welcome.

2) Should I spend a week exploring this data?

3) The sample is a strange subset of what the “complete” data must look like. For example:

> table(x$class, x$enrolled)
      
         0   1
  2011  86  99
  2013  96 119
  2014 123 105
  2015 124 116
  2016  77 125
  2017 232 159
  2019 172 143
  2020 164 170

a) Note that there is no data for the class of 2018. Perhaps removing this data is one way that Williams keeps track of who it gave this data to and, therefore, who it can go after for leaking it to me.

b) The numbers of students range for 185 for the class of 2011 to 391 for the class of 2017. Since around 1,250 applicants are admitted to Williams each year, we definitely don’t have the complete data.

c) It is interesting to see data for applicants that we admitted — I assume that everyone in this data was admitted — but who chose not to enroll.

d) Would you believe a 230 point difference between Asian-American and African-American SAT scores among Williams students?

> x %>% filter(enrolled == 1) %>% group_by(ethnicity) %>% 
     summarise(count = n(), act = round(mean(act, na.rm = TRUE)), 
               sat = round(mean(reading + math, na.rm = TRUE))) %>% 
     arrange(desc(sat))
# A tibble: 7 x 4
        ethnicity count   act   sat
               
1  Asian American   186    34  1506
2    Unidentified    18    34  1488
3           White   569    33  1480
4          Non-US    24    31  1374
5 Hispanic/Latino    99    30  1341
6 Native American     7    26  1302
7           Black   133    29  1274

That is what the data suggest . . .

Can’t resist adding an image:

density

Code for generating this below the break.

x %>% mutate(sat = math + reading) %>% filter(enrolled == 1) %>% filter(ethnicity %in% c(“Asian American”, “Black”)) %>% group_by(ethnicity) %>% ggplot(aes(sat, color = ethnicity)) + geom_density(na.rm = TRUE) + labs(title = “Distribution of SAT Scores by Race”, x = “Math + Reading SAT”, y = “Density”)

Facebooktwitter
Print  •  Email