I've been saying for ages that I need to learn the statistical programing language R, so that I can work with all the bioinformatic data we're generating. So yesterday I looked through the Coursera offerings and found an introductory statistics course that taught R (Data Analysis and Statistical Inference, taught by Mine Cetinkaya-Rundel of Duke University. It started a few weeks ago, and I've spent yesterday watching the Week 1 videos and doing the Week 1 R lab.
The labs are excellent. They use a web-based R learning platform called DataCamp - each lab is a long series of guided exercises: with each exercise you're given a bit of instruction and asked to use it to display something or graph something or calculate something. Integrated multiple-choice questions check your understanding - DataCamp automatically sends your results back to Coursera.
It's also very good that they're part of a basic statistics course, since I've always been disgracefully ignorant of this. The video lectures are good -short but many, and aimed at the complete beginner. I was initially quite pleased to be learning what 'marginal' means, and the differences between variance and standard deviation and standard error. The course materials are very well designed.
But I started getting frustrated when I tried to think more deeply about quantifying variation. I can sort-of understand why we want to know how variable the members of the population are, but this was never really explained, and I have no idea why we measure it the way we do. To me it seems simplest to measure variation by calculating how different each individual is from the mean (the deviations), summing the absolute values of these and dividing by the number of individuals. But that's not what's done. Instead we square each of the deviations and sum that, to get the 'variance'. But we don't use the variance (at least not yet), instead we take its square root, which we call the 'standard deviation' and use as our measure of variation. Why is this cumbersome measurement better that just taking the mean of the deviations? The instructor doesn't explain; the textbook doesn't explain.
In the first lecture of Week 3 we come to something called The Central Limit Theorem'. This is apparently a big deal, but I don't know what the point is. We've moved from considering the properties of a single sample from a population to considering the properties of many independent samples (size
n) from the same population - I have no idea why. The Central Limit Theorem tells us that, if we take many samples and calculate the mean of each one, the mean of these means will be the same as the population mean (is this not expected?), and that the shape of the distribution of means will be 'nearly normal', and that the standard deviation of the means will be smaller than that of the population, by a factor of 1/√
n. So what? What do we gain by repeating our sampling many times. Seems like a lot of work, to what end?
Then we're supposed to learn a list of conditions under which the Central Limit Theorem applies. But without understanding the point, this was too much like rote memorization to me, and why should I bother?