Skip to main content

Basic Statistics Lecture #8: Anscombe's Quartet

"There are three kinds of lies: lies, damned lies, and statistics."

As promised last time, I will be covering Anscombe's Quartet.  It is an idea where people may use statistics to lie about a data set.  It is a series of data sets developed by statistician Francis Anscombe and published in the journal American Statistician in 1973.

I'm going to provide you with four sets of data.  Do me a favor and apply what you know of statistical analysis to them.  If the statistical data look weird to you, don't be scared; you may have done the analysis perfectly.  Here's the four sets:

I II III IV
x y x y x y x y
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.1 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.1 4 5.39 19 12.5
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
If you've calculated the statistical data I've covered so far of these four separate sets, you'll notice that they are all identical.  This seems weird at first, especially with the y values, which are different for all four sets and the x values of the fourth set has one 19 and the other values being 8.  There are other statistical values which I will cover a little later called the correlation coefficient, coefficient of determination, and the linear regression equation, which are all the same for these data sets.

That's the point Francis was trying to make with these two data sets; even though multiple data sets may have the same statistical quantities, they are not equal by any stretch of the imagination.  To emphasize this point, he has corresponding graphs with these data sets:

All of these graphs have the same mean, standard deviation, and other statistical quantities.

These graphs immediately show what the problems are.  In the last two, we need to remove the extreme values which obviously don't fit the data, than recalculate.  That will be shown two posts after this one.  This is a visual representation of how statistics can be used as a lie.  There are other sets of sets which show this concept, but this is the first and most famous to show the point.

Don't get me wrong; statistics is a good tool to describe a series of data points if done properly; the problem lies with when statistical methods are done either poorly or with the intention of deception. With that in mind, it would be a good idea to look at the graph of the data before taking the statistics to heart.  If you want to take a look at the original paper concerning this phenomena, I'll link it here.

If you have any questions, please leave them in the comments. Next time, I'll begin the process of hypothesis testing. Until then, stay curious.

K. "Alan" Eister has his bacholers of Science in Chemistry. He is also a tutor for Varsity Tutors.  If you feel you need any further help on any of the topics covered, you can sign up for tutoring sessions here.

Comments

Popular posts from this blog

Basic Statistics Lecture #3: Normal, Binomial, and Poisson Distributions

As I have mentioned last time , the uniform continuous distribution is not the only form of continuous distribution in statistics.  As promised, here are the three most common continuous distribution types.  As a side note, all sampling distributions are relative to the algebraic mean. Normal Distribution: I think most people are familiar with the concept of a normal distribution.  If you've ever seen a bell curve, you've seen the normal distribution.  If you've begun from the first lecture of this lecture series, you've also seen the normal distribution. This type of distribution is where the data points follow a continuous curve, is non-uniform, has a mean (algebraic average) equal to the median (the exact middle value), falls from highest probability at the mean to (for all practical purposes) zero as the x-values approach $\pm \infty$, and therefor has equal number of data points to the left and to the right of the mean, and has the domain of $(\pm \i

Confidence Interval: Basic Statistics Lecture Series Lecture #11

You'll remember last time , I covered hypothesis testing of proportions and the time before that , hypothesis testing of a sample with a mean and standard deviation.  This time, I'll cover the concept of confidence intervals. Confidence intervals are of the form μ 1-α ∈ (a, b) 1-α , where a and b are two numbers such that a<b, α is the significance level as covered in hypothesis testing, and μ is the actual population mean (not the sample mean). This is a the statement of there being a [(1-α)*100]% probability that the true population mean will be somewhere between a and b.  The obvious question is "How do we find a and b?".  Here, I will describe the process. Step 1. Find the Fundamental Statistics The first thing we need to find the fundamental statistics , the mean, standard deviation, and the sample size.  The sample mean is typically referred to as the point estimate by most statistics text books.  This is because the point estimate of the populati

Basic Statistics Lecture #5: Baye's Theorem

As promised last time , I am going to cover Baye's Theorem. If Tree diagram is the common name for Bayes Theorem.  Recall that conditional probability is given by $P(A \mid B) = \frac{P(A \wedge B)}{P(B)}$.   For tree diagrams, let's say that we have events A, B 1 , B 2 , B 3 , … (the reason we have multiple B's is because they all are within the same family of events) such that the events in the family of B are mutually exclusive and the sum of the probabilities of the events in the family of B are equal to 1. Then we have $$P(B_i \mid A)= \frac{P(B_i)*P(A \mid B_i)}{\sum_{m=1}^{n}[P(B_m)*P(A \mid B_m)]}$$  What this means is reliant on the tree diagram. If we are only looking at the sub-items of A, this is what the tree diagram would look like. If J has a probability of 100%, and P(C) and P(D) are not 0, then when we are trying to find the probability of any of the B's being true given that A is true, we have to set the probability of A to be the entir