She lost me at the Central Limit Theorem

I've been saying for ages that I need to learn the statistical programing language R, so that I can work with all the bioinformatic data we're generating.  So yesterday I looked through the Coursera offerings and found an introductory statistics course that taught R (Data Analysis and Statistical Inference, taught by Mine Cetinkaya-Rundel of Duke University.  It started a few weeks ago, and I've spent yesterday watching the Week 1 videos and doing the Week 1 R lab. 

The labs are excellent.  They use a web-based R learning platform called DataCamp - each lab is a long series of guided exercises: with each exercise you're given a bit of instruction and asked to use it to display something or graph something or calculate something.  Integrated multiple-choice questions check your understanding - DataCamp automatically sends your results back to Coursera.

It's also very good that they're part of a basic statistics course, since I've always been disgracefully ignorant of this.  The video lectures are good -short but many, and aimed at the complete beginner.  I was initially quite pleased to be learning what 'marginal' means, and the differences between variance and standard deviation and standard error.  The course materials are very well designed.

But I started getting frustrated when I tried to think more deeply about quantifying variation.  I can sort-of understand why we want to know how variable the members of the population are, but this was never really explained, and I have no idea why we measure it the way we do.  To me it seems simplest to measure variation by calculating how different each individual is from the mean (the deviations), summing the absolute values of these and dividing by the number of individuals.  But that's not what's done.  Instead we square each of the deviations and sum that, to get the 'variance'.  But we don't use the variance (at least not yet), instead we take its square root, which we call the 'standard deviation' and use as our measure of variation.  Why is this cumbersome measurement better that just taking the mean of the deviations?  The instructor doesn't explain; the textbook doesn't explain.

In the first lecture of Week 3 we come to something called The Central Limit Theorem'.  This is apparently a big deal, but I don't know what the point is.  We've moved from considering the properties of a single sample from a population to considering the properties of many independent samples (size n) from the same population - I have no idea why.  The Central Limit Theorem tells us that, if we take many samples and calculate the mean of each one, the mean of these means will be the same as the population mean (is this not expected?), and that the shape of the distribution of means will be 'nearly normal', and that the standard deviation of the means will be smaller than that of the population, by a factor of 1/√n.  So what?  What do we gain by repeating our sampling many times.  Seems like a lot of work, to what end?

Then we're supposed to learn a list of conditions under which the Central Limit Theorem applies.  But without understanding the point, this was too much like rote memorization to me, and why should I bother?

4 comments:

  1. The variance is preferred to your measure of variation because it's linked, mathematically, to the definition of a normal distribution; specifically it's one of two parameters (the other being the mean) that completely describe a normal distribution. Because most of statistics is about comparing what happens in normal distributions, this is correspondingly important to statisticians.

    The point of the central limit theorem is not that you repeat things multiple times but because it lets you understand how the sample mean relates to the population mean. When you're a getting a p-value from a statistical test what you're doing is calculating a probability that the population means of your two populations are distinct given your sample data. To derive this out you need to know how your sample mean links to the population mean: hence the central limit theorem.

    Note that while you describe this as being obvious, mathematicians are never happy unless they've proved something and, in fact, the central limit theorem only holds under some conditions and when these conditions are violated it can be the case that the mean of the sample means does NOT tend to the population mean.

    ReplyDelete
  2. One of the uses of statistics in science is to compare and rank models, whereas statisticians tend to be more or equally concerned with making predictions about future measurements.

    The standard deviation (SD) allows you to make predictions for future measurements while the mean absolute deviation (MAD) doesn't (as far as I know).

    For example, if you make an additional measurement there is a 68% chance that it will be within 1 SD of the mean that you have already measured (assuming a normal distribution). I don't think you can make a similar statement based on a MAD.

    Similarly if you make 10 new measurements, the central limit theorem can be used to predict how close the average of these measurements will be to the average you have already measure (on a larger sample).

    I found this video very instructive: https://www.khanacademy.org/math/probability/statistics-inferential/sampling_distribution/v/central-limit-theorem

    ReplyDelete
  3. A couple of other add ons. Variance is also really handy for multidimensional data, so we you do regressions or Principle component analysis that is all done through variances and covariances, and therefore the need for the squaring. When it comes to standard deviation in some sense you are getting the absolute value, because the technical definitions of absolute value that I have seen was that it was the squareroot of the number squared. Which seems rather strange, but is a neat way to define an algorithm to get the absolute value.

    ReplyDelete
  4. Here's a whole paper on the question of standard vs mean deviation: http://www.leeds.ac.uk/educol/documents/00003759.htm

    Money quote:

    "Given that both SD and MD do the same job, MD's relative simplicity of meaning is perhaps the most important reason for henceforth using and teaching the mean deviation rather than the more complex and less meaningful standard deviation. Most researchers wishing to provide a summary statistic of the dispersion in their findings do not want to manipulate anything, whether algebraically or otherwise. For these, and for most consumers of research evidence, using the mean deviation is more democratic."

    ReplyDelete

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS