I continue to be struck/inspired by Professor Jackall’s description of Williams as a “cross-generational community of learning.” His is a marvelous phrase, although we may need to shorten it into a snappy acronym like CGCL. As an example of the sort of stuff that should occur in a true CGCL, we have Professor Bernhard Klingenberg of the Math Department providing us with some more commentary about our Eph Pool Geekery.

[Y]our reasoning in the blog is correct. Almost all surveys are truly without replacement as you hardly ever ask the same person twice. This violates the “identical” assumption for Bernoulli trials, which says that the probability of success stays the same from trial (i.e. student) to trial (i.e., student). (Note that only Bernoulli trials make up a binomial response). As we sample more and more students, the probability that we are going to meet a Republican gets smaller and smaller (because there are so few) and hence doesn’t stay constant.

More details are below, but I wanted to highlight our thanks to Professor Klingenberg for taking the time to explain all of this to the sometimes befuddled alums who read EphBlog. Professors, especially untenured professors, are busy people and every minute that Klingenberg spent on this was a minute that he wasn’t spending on his research or on teaching current students.

Because, like Jackall, I believe that Williams is, or should aspire to be, a cross-generational community of learning, I think that this is good and proper use of Klingenberg’s time. Indeed, it is impossible to have a CGCL unless people (faculty, alums and students) all take the time to teach and to listen. Kudos to Professor Klingenberg for showing us how it is done (and to Professor Sam Crane have having already done so on many other occasions).

On the other hand, finding a Democrat gets more likely the more students we sample. The correct model which describes the number of Democrats (denoted by x) in a sample of n=375 students from a population consisting of M Democrats and N-M Republicans (idealized model, ignores other options as independent or no opinion) is the hypergeometric. M is unknown to us, but we know that there are a total of N=2000 students. The expected value of x is

E[x]=n*M/N

and the variance of X is

Var[x]= [N-n/(N-1)]* n*M/N*(N-M)/N.

Dividing by n, we get the mean and the variance for the sample proportion p_hat as

E[p_hat]=M/N

Var[p_hat]= [N-n/(N-1)]*M/N*(N-M)/N*1/n.

To form a confidence interval for the proportion of students who vote Democrat, we still allude to the central limit theorem (we assume that individual responses are independent, e.g., no student is influenced by the opinion of another student!) and estimate the unknown M from the sample by multiplying the sample proportion with the total population size, e.g.,

M_hat=p_hat*N.

Substituting this into the formulas above, a 95% confidence interval for the true proportion p is given by

p_hat +/- 1.96*sqrt([N-n/(N-1)]*p_hat*(1-p_hat)/n).

As you pointed out (almost) correctly, the correction factor when compared to a “regular” confidence interval for a proportion is sqrt([N-n/(N-1)]), i.e., sqrt(1625/1999), which is about 0.9. Notice, as N grows larger, for constant n, this correction factor gets closer to 1, i.e., has less of an impact.

Give me a prior for the proportion, and I can give you the Bayesian interval. The Bayesian posterior interval has the nice feature that it can be interpreted as the probability of containing the true parameter, whereas the frequentist cannot! 95% confident means that 95 out of 100 confidence intervals we construct with (imaginary) samples of the same size will contain the true parameter, but not that the probability is 95%, as someone on the thread was implying!

Oren Cass ’05 was also kind enough to provide the exact procedure that the Record used in calculating its confidence interval.

I calculated the margin of error for our poll using the typical formula with adjustment for population size:

moe (+/- as %) = 1.96*sqrt((1-p)*p/n)*sqrt((pop-n)(pop-1))*100

In your most recent post on the subject, you came up with a Margin of Error of around 4.1% [or 3.7% after the 0.9 correction] using this formula because you used p = .784. If we were only reporting one piece of data, the percentage supporting Kerry in the election, it would arguably be correct to use that p although it might still make more sense to use a p = .5 so as to give the most conservative and therefore certainly accurate confidence interval.

However, our poll covered various other questions that had results closer to the 50% mark, making a p = .5 most appropriate. Since we were reporting on the Margin of Error for the survey as a whole, and not only the Kerry/Bush question, it was therefore necessary to use p = .5. We provided sample size data to allow others to calculate the margins of error for individual questions if they were so inclined.

Therefore, if we use p = .5, n = 375, pop = 2000 in our equation we get:

moe = 1.96*sqrt(.25/375)*sqrt(1625/1999)*100 = +/- 4.556%

which we then rounded to 4.6%

Thanks to Cass for providing these details.

I will be posting many further pages of Eph Poll Geekery analysis over the next few days. Be sure to read them all closely!