Unless you have arbitrary amounts of time to spend on obscure aspects of the debate over cluster housing, you should stop reading right now!

But that’s a cool graphic, eh? Short version: In the current set up, the worst JAs are ranked around 50. Under cluster selectiin, they would normally be ranked in the 60s, but, in unlucky years, significantly lower. Full details below.

As a follow up to our discussion of the worries of the JA Advisory Board that making JA selection cluster-based would significantly, hurt the quality of JAs selected, I thought that all the geeks among EphBlog’s readers would like to see the knitty-gritty of my analysis. [Liar! You just want to show off some cool R graphics. — ed. True, but I did get one real request.]

Recall that, in the current set up, the JA Selection Committee can pick the best 50 candidates out of a pool of 150. If JA selection were cluster based — note that this is not part of the current plan, but will be sure to follow after the initial failure of clusters to develop any meaningful identity — then the committee would need to pick 10 candidates, 5 men and 5 women, from each of the five clusters. Problems would arise when, in some years, the pool of candidates from a given cluster was weak. The Selection Committee would have no choice but to pick among those weak candidates. The result would be lower quality JAs on average.

But is this just a theoretical problem? Well, using reasonable assumptions, here is how big the problem would be:

median          75th           95th            max
Min.   :24.5   Min.   :36.2   Min.   : 48.5   Min.   : 53.0
1st Qu.:25.5   1st Qu.:39.5   1st Qu.: 60.5   1st Qu.: 74.0
Median :25.5   Median :40.8   Median : 65.8   Median : 84.0
Mean   :25.9   Mean   :41.7   Mean   : 67.5   Mean   : 86.2
3rd Qu.:26.0   3rd Qu.:43.5   3rd Qu.: 72.5   3rd Qu.: 95.0
Max.   :33.5   Max.   :64.8   Max.   :125.5   Max.   :150.0

What does this mean? Assume that we can rank the JA applicants from 1 (best) to 150 (worst). These are the distributions of the population statistics for selected JAs over 10,000 samples of the cluster-based selection process. For example, consider the “max” column. In 10,000 tries, the best value for max — meaning the highest rank for the worst JA selected — was 53. In this simulation, all the JAs selected, even though 10 had to be selected from each cluster, were ranked 53 or better. In another simulation, this value was 150. In other words, there was a cluster with just 10 applicants, one of whom was the worst potential JA in the pool. He was selected.

These are extreme and unlikely cases. The median value for “max” is 84, though. This may fairly be taken as evidence for the position of the JA Advisory Board. Under the current system, the worst JA selected has a rank of 50, by assumption. With a cluster plan, this would increase to 84, typically. How much you care about this depends on how much you think 1 JA out of 50 matters and how different in quality you think JA 84 is from JA 50.

Of course, focussing on the “max” value is unfair since it makes the difference between cluster and non-cluster selection as large as possible. Perhaps a more useful point on the distribution would be the 95th percentile, between the 47th and 48th ranked JA. I display the histogram of simulation runs for this case above. The median case is for a rank of 66 or so. That is, under cluster sampling, we will typically have at least 3 JAs who are ranked lower than 66 out of the 150 candidates. One quarter of the time, the worst 3 JAs will be ranked lower than 72.

Note that these results are (slightly) different from the last ones that I presented. Code follows. Note that this is quite sloppy. I did not take the time to think hard about how to handle cases where the numbers don’t divide easily. I don’t believe that the results would be effected, however.

As a side note, I should highlight that the public presentation of all the steps necessary to recreate a given research result is the hallmark of serious scholarly work. This particular project isn’t, obviously, all that serious, but I have tried to be scholarly.

## Simple simulation of how bad the JA’s chosen under cluster housing will be.

## Base case assumes:
## 75 male and 75 female candidates.
## 5 clusters: A, B, C, D, E
## Random distribution of candidates among clusters.

sim.ja <- function(candidates = 150, clusters = 5, num.ja = 50, reps = 10){ ## Key number is how many per each gender from each cluster we need. num.per.gender.per.cluster <- (num.ja / clusters) / 2 ## Easiest to assume an integer number of winners per cluster for ## each gender, an even number of candidates and an even number of ## male and female JA's to be chosen per cluster. stopifnot(num.per.gender.per.cluster %% 1 == 0) stopifnot(candidates %% 2 == 0) stopifnot((candidates / clusters) %% 2 == 0) # These are the candidates true ranks, which the JA selection committe can observe. ranks <- 1:candidates gender <- rep(c("F", "M"), candidates / 2) ## What clusters do these candidates come from for each simulation? allocations <- matrix(sample(LETTERS[1:clusters], candidates*reps, replace = TRUE), nrow = candidates, ncol = reps, dimnames = list(NULL, paste("sim", 1:reps, sep = ".")) ) ## Put together our dataframing. Note that the first two columns are ## constant. The remaining "reps" columns represent a simulation, ## where, for example, the highest ranked candidate lives in cluster ## A or cluster D or whatever. x <- data.frame(ranks, gender, allocations) ## Now, assume that the JA selection committee selects the highest ## ranking 25/clusters men and 25/clusters women from each ## cluster. Again, by assumption, the selection committee observes ## the true ranks and tries to get the highest ranked candidates, ## subject to the constraint that the same number (of males and ## females?) comes from each cluster. ## Note that, without the assumption that the same number of males ## and females come from each cluster, we would have to design the ## code differently, I think. I suspect that doing so would make the ## cluster case look better by a non-trivial amount. ## Which candidates are chosen? This is tricky since you need the ## top 5 men and women from each cluster. First pass, I'll do this ## via a loop, but there must be a cooler way. z <- matrix(rep(NA, reps*5), nrow = reps, ncol = 5, dimnames = list(NULL, c("mean", "median", "75th", "95th", "max"))) counter <- 1 for(i in 1:reps){ ## At the start of each simulation, we do not know the ranks of any of the JA's to be selected. sim.ranks <- NULL for(j in LETTERS[1:clusters]){ ## Within each simulation and for each cluster, we want the ## ranks of the correct number of the best candidates for each ## gender. F.ranks <- x$ranks[x[[paste("sim", i, sep = ".")]] == j & x$gender == "F"][1:num.per.gender.per.cluster] M.ranks <- x$ranks[x[[paste("sim", i, sep = ".")]] == j & x$gender == "M"][1:num.per.gender.per.cluster] stopifnot(all(! F.ranks %in% M.ranks)) cluster.ranks <- c(F.ranks, M.ranks) stopifnot(all(! sim.ranks %in% cluster.ranks)) sim.ranks <- c(sim.ranks, cluster.ranks) } z[counter, "mean"] <- mean(sim.ranks, na.rm = TRUE) z[counter, "median"] <- median(sim.ranks, na.rm = TRUE) z[counter, "75th"] <- quantile(sim.ranks, p = 0.75, na.rm = TRUE) z[counter, "95th"] <- quantile(sim.ranks, p = 0.95, na.rm = TRUE) z[counter, "max"] <- max(sim.ranks, na.rm = TRUE) counter <- counter + 1 } return(as.data.frame(z)) } ## Once that above function is read in, the R session goes like: > library(MASS)
> set.seed(1)
> x <- sim.ja(reps = 10000) # May take some time depending on your computer > truehist(x[[“95th”]], xlab = “95th percentile of JA rank in a given simulation”, ylab = “Frequency”, main = “Lower Tail JAs Selected Via Clusters”)

Print  •  Email