Sun 5 Dec 2004

Only a handful of our readers could possibly care about this, but I really want to figure out the correct answer to the question: How many Williams undergraduates preferred John Kerry for President?

Geekery follows below:

Recall that our previous thread on the topic referenced this *Record *poll. The key statistics are:

Sample size: 375

Kerry supporters: 294 (from 78.4% support for Kerry)

Williams students: 2,000 (more or less)

Given the commentary from David Nickerson and Todd Gamblin, I am fairly certain that the correct standard error is 2.1%. The textbook formula in this case is:

sqrt(p * (1-p) / N) or

sqrt(0.78 * 0.22 / 375) = 0.021

Given the large N, it must be OK to use a normal approximation and, so, a 95% confidence interval for the true portion of Kerry supporters would be 78.4 +/- 1.96 * 2.1.

That is, we can be 95% certain that, if the we were to ask every single Williams students, somewhere between 74.3% and 82.5% would vote for Kerry. This would put the total number of Kerry supporters somewhere between 1,486 and 1,650.

But I think that this is wrong or, rather, I think that it is too wide. Recall that this calculation would be correct even if the total number of Williams students were 20,000 or even 20 million. We are not making use of the fact that 375 is a good-sized portion of the entire student body.

Consider a thought experiment in which there are only 400 students at Williams, 375 of whom the *Record* surveyed. All the above calculations would still apply. The confidence interval would then be from 297 to 330. But this would make no sense since we already know that, *at minimum*, 314 (78.4% of 400) Eph support Kerry.

In other words, when your sample size is a significant fraction of the total population that you care about (and 375 is a significant fraction of 2,000) you should not use the binomial distribution in calculating your standard errors. Instead, you should use the hypergeometric.

To smarter Ephs than me: Is that correct?

Assuming that it is, then we should be adjusting the confidence interval to be more narrow. I *think* that the approximate solution is to scale the variance by the proportion not sampled (which, for large populations) just means multiplying by a number close to one. In our case, it means multiplying the variance by 1625/2000 or 0.81. This implies scaling the standard deviation by the square root of this, or 0.9.

So, my guess is that the true 95% confidence interval is 10% more narrow than 1,486 and 1,650. I would put it at more like 1,494 to 1,642. This is, obviously, not a big enough shrinkage to matter but it is nice to try and get the right answer.

Exercise for the reader: Analyze this data as a good Bayesian should.

Print • Email« Get a Life |
True or False? » |

## One Response to “Eph Poll Geekery”

You can follow this conversation by subscribing to the comment feed for this post

If a comment you submitted does not show up, please email us at eph at ephblog dot com. Please note that commenters are required to use a valid email address when submitting comments.

Diana says:

One of the crucial assumptions when you sample from a population and fit a normal model is that you are questioning less than 10% of the population. Below 10%, the fact that you are sampling without replacement doesn’t have much effect, but in the case of the ~20% you cite, it begins to have an effect.

When you survey more than about 10% of the population, you have to change your confidence interval calculation to something that approaches the traditional “sampling without replacement” situation of a sock drawer with a finite number of socks. However, the mechanism for something when your sample is greater than 10% but less than the whole population is too hard, or maybe too time-consuming or too little-used, for us to learn in Stat 201.

December 5th, 2004 at 8:05 pm