In the spirit of transparency, here are some notes on my Winter Study.

1) We have an e-mail list for the class. But guests are welcome! Let me know if you want to sign up.

2) We started the class with 6 people. One student dropped for a good reason but two others dropped for the wrong reasons. The remaining three are ready to work hard, I hope!

3) Here is the graphic that we created in the first class.

Pretty cool, eh? See the code below for details and reasoning. Any R users out there willing to try to replicate our work? That would be cool.


## Notes from class. Goal was to create a histogram of the ages of
## Williams faculty members. Assumes that you have grabbed a copy of
## the officially faculty listing.

## http://www.williams.edu/Registrar/geninfo/faculty.pdf

## You then need to save a copy of that file as text in your working directory.

## First, we need to read in the data and store it as x, a character
## vector.

x = readLines("faculty.txt")

## By inspection, we notice that the file has all sorts of junk in
## it. The leading and ending rows are useless. There are a lot of
## empty rows. There are several page numbers and weird
## characters. So, by trial and error, we figure out which rows those
## are and then delete them

x = x[10:(length(x)-4)]

x = x[x != ""]

x = x[nchar(x) > 10 & x != "Second Semester"]

## Needless to say, we could have done these steps more concisely, but
## that would have made the logic less clear.

## Now, we want to figure out all the elements that include a
## graduation year. Hard to do, especially since various professor
## titles include 4 digit years. How do we avoid a row like this?

## Frank Morgan, Webster Atwell Class of 1921 Professor of Mathematics

## Need some regular expression magic. Here is a slightly cleaner
## version of what we came up with in class.

degree = grep(“\\(\\d”, x, value = TRUE)

## This finds every element in x with an opening paranthesis, followed
## by a number. (Of course, we should be check for a four digit year
## within the (), but this is good enough for now.) Look at degree to
## make sure it looks OK. Because ( is a special character in regular
## expressions, we need to “escape” it with the backslash. And,
## because we are doing all this from within R, we need to escape the
## escape, hence two backslashes. See help(regex) for background.

## Now, we have a bunch of elements that look like:

## “B.S. (1989) M.I.T.; Ph.D. (1994) M.I.T. ”

## But all we want is the number 1989, the year of first graduation,
## which will mostly be the undergraduate degree. Fortunately, some
## more regular expression magic does that.

## First, we find the location in each string of where the year starts.

start.point = regexpr(“\\d{4}”, degree)

## Then we use that information to pull out the four character year.

year = substr(degree, start.point, start.point + 3)

## But year is character, so we coerce it to numeric, and then
## subtract if from the current year (2010) in order to get years
## since graduation, and then add 22 as an estimate of the average age
## at graduation.

age = 2010 – as.numeric(year) + 22

## And then a simple histogram.

hist(age, main = “Estimated Ages of Williams Faculty”)
————————

In class today, we discovered a bug in this code. The histogram is the same, but further analysis is slightly messed up. Do you see it?

Facebooktwitter
Print  •  Email