Skip to main content

ANOVA And Multiple Data Sets: Basic Statistics Lecture Series Lecture #14

In this last lecture of the Basic Statistics Lecture Series, as promised last time, I will cover ANOVA, or ANalysis Of VAriance.  This is a process of comparing 3 or more distinct samples.  In the MLB, this means comparing the 6 different divisions of the MLB, or the three leagues of each league (National League or American League).

The point of ANOVA is to test whether or not there is at least one inequality between three or more means.  This means that there could be as many groups as the experiment you're running requires, and the null hypothesis is always that the mean of every single group has a mean which is statistically equivalent to every other group mean.  If there is even one inequality in the group, then the null hypothesis is rejected.

As mentioned last time, when we run multiple regression analysis on some data, there is a table called ANOVA.  For the baseball data I've been using throughout this entire lecture series, that table looks like this:

ANOVA

df
SS
MS
F
Significance F
Between
2
199629.038
99814.51899
2644.188249
1.06832E-31
Within
28
1056.962012
37.74864329


Total
30
200686




The second column is the degrees of freedom for the whichever size where talking about; either the number of variables (the "Between" category) or the total number of individuals in both sets (the "within" category).

The third column is the Sum of Squares, where we take the difference between a dada point and the mean, squaring it, performing those calculations for all data points, and adding together (taking the sum of) the answers for all data points.

The fourth column is the mean square column, which is the standard deviation of the sum of squares.  This is taking the sum of squares value and dividing it by its corresponding degrees of freedom.  Notice that there are only two MS's reported; this is because the total MS is always irrelevant to ANOVA testing.

The F value is the test statistic in ANOVA.  It's equivalent to the t-statistic when testing the difference between two means.  It is the value which you compare to the critical value from the table.  The test statistic is calculated by taking the MS value for the between groups and dividing it by the MS value for the within group.

The significance F is equivalent to obtaining the p-value from the t-statistic.  It is the value which you would compare to the chosen value of alpha to determine whether or not to reject the null hypothesis.

So in this case, even for values of alpha which is less than half a percent, the null hypothesis is rejected.  In this case, that means that there is a difference between the runs scored and runs allowed.  It can also be used to determine whether or not a particular multiple regression analysis is statistically relevant.  In this case, the null hypothesis is that a less restrictive multiple regression would be false.  So this means that the restricted regression analysis -- the prediction that the number of wins are based on runs allowed and runs scored alone -- is the correct analysis.

That's the basics of Statistical ANOVA testing.  If you have any questions, please leave it in the comments.  This is the end of the Basic Statistics Lecture Series, so I hope you use this information well.  Stay curious, my friends.

K. "Alan" Eister has his bacholers of Science in Chemistry. He is also a tutor for Varsity Tutors.  If you feel you need any further help on any of the topics covered, you can sign up for tutoring sessions here.

Comments

Popular posts from this blog

Basic Statistics Lecture #3: Normal, Binomial, and Poisson Distributions

As I have mentioned last time , the uniform continuous distribution is not the only form of continuous distribution in statistics.  As promised, here are the three most common continuous distribution types.  As a side note, all sampling distributions are relative to the algebraic mean. Normal Distribution: I think most people are familiar with the concept of a normal distribution.  If you've ever seen a bell curve, you've seen the normal distribution.  If you've begun from the first lecture of this lecture series, you've also seen the normal distribution. This type of distribution is where the data points follow a continuous curve, is non-uniform, has a mean (algebraic average) equal to the median (the exact middle value), falls from highest probability at the mean to (for all practical purposes) zero as the x-values approach $\pm \infty$, and therefor has equal number of data points to the left and to the right of the mean, and has the domain of $(\pm \i

Confidence Interval: Basic Statistics Lecture Series Lecture #11

You'll remember last time , I covered hypothesis testing of proportions and the time before that , hypothesis testing of a sample with a mean and standard deviation.  This time, I'll cover the concept of confidence intervals. Confidence intervals are of the form μ 1-α ∈ (a, b) 1-α , where a and b are two numbers such that a<b, α is the significance level as covered in hypothesis testing, and μ is the actual population mean (not the sample mean). This is a the statement of there being a [(1-α)*100]% probability that the true population mean will be somewhere between a and b.  The obvious question is "How do we find a and b?".  Here, I will describe the process. Step 1. Find the Fundamental Statistics The first thing we need to find the fundamental statistics , the mean, standard deviation, and the sample size.  The sample mean is typically referred to as the point estimate by most statistics text books.  This is because the point estimate of the populati

Basic Statistics Lecture #5: Baye's Theorem

As promised last time , I am going to cover Baye's Theorem. If Tree diagram is the common name for Bayes Theorem.  Recall that conditional probability is given by $P(A \mid B) = \frac{P(A \wedge B)}{P(B)}$.   For tree diagrams, let's say that we have events A, B 1 , B 2 , B 3 , … (the reason we have multiple B's is because they all are within the same family of events) such that the events in the family of B are mutually exclusive and the sum of the probabilities of the events in the family of B are equal to 1. Then we have $$P(B_i \mid A)= \frac{P(B_i)*P(A \mid B_i)}{\sum_{m=1}^{n}[P(B_m)*P(A \mid B_m)]}$$  What this means is reliant on the tree diagram. If we are only looking at the sub-items of A, this is what the tree diagram would look like. If J has a probability of 100%, and P(C) and P(D) are not 0, then when we are trying to find the probability of any of the B's being true given that A is true, we have to set the probability of A to be the entir