Only a handful of our readers could possibly care about this, but I really want to figure out the correct answer to the question: How many Williams undergraduates preferred John Kerry for President?

Geekery follows below:

Recall that our previous thread on the topic referenced this Record poll. The key statistics are:

Sample size: 375
Kerry supporters: 294 (from 78.4% support for Kerry)
Williams students: 2,000 (more or less)

Given the commentary from David Nickerson and Todd Gamblin, I am fairly certain that the correct standard error is 2.1%. The textbook formula in this case is:

sqrt(p * (1-p) / N) or
sqrt(0.78 * 0.22 / 375) = 0.021

Given the large N, it must be OK to use a normal approximation and, so, a 95% confidence interval for the true portion of Kerry supporters would be 78.4 +/- 1.96 * 2.1.

That is, we can be 95% certain that, if the we were to ask every single Williams students, somewhere between 74.3% and 82.5% would vote for Kerry. This would put the total number of Kerry supporters somewhere between 1,486 and 1,650.

But I think that this is wrong or, rather, I think that it is too wide. Recall that this calculation would be correct even if the total number of Williams students were 20,000 or even 20 million. We are not making use of the fact that 375 is a good-sized portion of the entire student body.

Consider a thought experiment in which there are only 400 students at Williams, 375 of whom the Record surveyed. All the above calculations would still apply. The confidence interval would then be from 297 to 330. But this would make no sense since we already know that, at minimum, 314 (78.4% of 400) Eph support Kerry.

In other words, when your sample size is a significant fraction of the total population that you care about (and 375 is a significant fraction of 2,000) you should not use the binomial distribution in calculating your standard errors. Instead, you should use the hypergeometric.

To smarter Ephs than me: Is that correct?

Assuming that it is, then we should be adjusting the confidence interval to be more narrow. I think that the approximate solution is to scale the variance by the proportion not sampled (which, for large populations) just means multiplying by a number close to one. In our case, it means multiplying the variance by 1625/2000 or 0.81. This implies scaling the standard deviation by the square root of this, or 0.9.

So, my guess is that the true 95% confidence interval is 10% more narrow than 1,486 and 1,650. I would put it at more like 1,494 to 1,642. This is, obviously, not a big enough shrinkage to matter but it is nice to try and get the right answer.

Exercise for the reader: Analyze this data as a good Bayesian should.

Print  •  Email