Skip to main content

Regression Analysis: Basic Statistics Lecture Series Lecture #13

As promised last time, I am going to cover the basics of simple regression analysis.  It is simple because there is only one independent variable for the dependent variable.

As a side note here, most of you are likely familiar with the phrase "correlation does not imply causation".  This statement, while true, is misleading.  Yes, it is true that not all correlations are causal, it is also true that all causation's are correlated.  Causation cannot occur without correlation, so correlation is the first necessary step to show causation, but it is also an insufficient step.

For example, if we look at the graph of MLB wins on the y-axis and the ratio of runs scored to runs allowed on the x-axis, it is easily apparent that the ratio and the number of wins are highly correlated.  This is a causal relationship as well.  The more runs you score and the fewer runs you give up, the more you'll win, because the winner is the team who scores more runs than they give up in any given game.  This is a case in point to show that correlation is necessary to show causation.

Hey look!  I can win without scoring!
Of course, the intercept has to make sense.  The corrected graph of the above would look like this:

Lower coefficient of correlation, but makes more sense.
But it is insufficient, and here's a case in point to show that concept.  There are many things which have risen along side one another which are coincidental.  Two of these things are the divorce rate in the U.S. State of Maine and the per capita consumption of margarine in the whole of the United States.  While it is easy to make the joke about men who can't cook, nobody in their right mind would actually say that this is causal, because it obviously is not.
I eat margarine because Joe and Susan in Portland, Maine got divorced!
So yes, correlation is a necessary first step in showing causality, it is still insufficient.  This post covers how to show that first step.  The mechanism for showing causality from correlation depends on the particular field at which you are looking.  For instance, Newton showed that falling is caused by gravity by dropping balls in an exceedingly scientific way, but if you wanted to show that people falling on their heads causes concussions, you will not be dropping people on their heads.  Well, I hope not, anyways.

Now that the tangent is over, let's get into simple linear regression.

In the previous lectures, I've been dealing with a single continuous variable.  With simple regression, we deal with two continuous variables which interact and ask "What is the association between the two variables?"  We need three things in order to develop a regression analysis:
  1. The response variable y
  2. The predictor variable x
  3. A sample from y and x in order to have sample data

The sample data will always come in pairs. Once these are obtained, we plot the sample points on an x-y graph.  This will yield a positive linear association when the linear slope is positive and a negative linear association when the slope is negative. When there appears to be no linearity or curve, there is no association.

The goal of simple linear regression is to find the best line that describes the sample with the least squares calculation.  To put it another way, to find the line y=mx+b which minimizes the average distance between the line and the data points.  To find this line, perform the following methods:
  1. Develop a chart with x-values, their corresponding y-variables, x-squared, y-squared, x*y, the x-point minus the mean of x's, and the y-point minus the mean of y's.
  2. After the chart is filled with the appropriate values, find the sum of each column.  (Add together all of the values of each column.)
  3. Perform the calculation for the y-intercept b by running the following formula: $$b=\frac{[(\sum{(y)})*(\sum{(x^2)})-(\sum (x))*(\sum (x*y))]}{[n*(\sum (x^2)) -(\sum (x))^2]}$$
  4. Perform the calculation for the slop by running the following formula: $$b=\frac{[n*(\sum (x*y))-(\sum (x))*(\sum (y))]}{[n*(\sum (x^2))-(\sum (x))^2]}$$
Be careful with these calculations, though; the order of operations still applies.  This means that everything inside the parentheses are done first and foremost, and exponents are done second.  The only instant where exponents are done at the same time as the parentheses is when the exponent is inside the parentheses.  This means that, in the denominator, the term $(\sum (x^2))$ is adding together all of the values in the x2 column together, while $(\sum (x))^2$ is adding up all of the values in the x column, and squaring the result.  The third operation performed is the multiplication and division operations.  The fourth thing to perform is the addition and subtraction operations.  Also, the square brackets that I have in the numerator and denominator of each equation is the same as the parentheses, so that means calculate the numerator first and calculate the denominator second.  At the last step of the calculations, you should have a single number in the numerator and a single number in the denominator.

There is also a calculation called the coefficient of determination, denoted as R2, which is a measure of precisely how correlated the two variables are correlated with one another.  The quantities to determine the coefficient of correlation come from the table you've created with x's, y's, x-squared's, y-squared's, and x*y's.  Simply run the following equation: $$R^2=(\frac{[n*\sum(xy)-(\sum(x)*\sum(y))]}{[\sqrt{(n*\sum(x^2))-(\sum(x))^2}*\sqrt{(n*\sum(y^2))-(\sum(y))^2}]})^2$$.  This allows to see how closely two variables correlate with one another.  The value of R2, will always be somewhere between -1 and +1.  The closer to zero the value is, the less correlated the two variables are going to be.  The further away from zero the value is (the closer to $\pm 1$), the more correlated the two variables are.  An R^2=-1 means perfect negative correlation (one goes up as the other goes down; money spent vs. money in your position is a good example of negative correlation), while an R^2=+1 is perfect positive correlation (wins vs runs scored per run allowed is a good example of this).

That's it for simple regression analysis.  If you have any questions, please leave it in the comments.  Next time, I'll be covering multiple regression.

K. "Alan" Eister has his bacholers of Science in Chemistry. He is also a tutor for Varsity Tutors.  If you feel you need any further help on any of the topics covered, you can sign up for tutoring sessions here.

Comments

Popular posts from this blog

Basic Statistics Lecture #3: Normal, Binomial, and Poisson Distributions

As I have mentioned last time , the uniform continuous distribution is not the only form of continuous distribution in statistics.  As promised, here are the three most common continuous distribution types.  As a side note, all sampling distributions are relative to the algebraic mean. Normal Distribution: I think most people are familiar with the concept of a normal distribution.  If you've ever seen a bell curve, you've seen the normal distribution.  If you've begun from the first lecture of this lecture series, you've also seen the normal distribution. This type of distribution is where the data points follow a continuous curve, is non-uniform, has a mean (algebraic average) equal to the median (the exact middle value), falls from highest probability at the mean to (for all practical purposes) zero as the x-values approach $\pm \infty$, and therefor has equal number of data points to the left and to the right of the mean, and has the domain of $(\pm \i

Confidence Interval: Basic Statistics Lecture Series Lecture #11

You'll remember last time , I covered hypothesis testing of proportions and the time before that , hypothesis testing of a sample with a mean and standard deviation.  This time, I'll cover the concept of confidence intervals. Confidence intervals are of the form μ 1-α ∈ (a, b) 1-α , where a and b are two numbers such that a<b, α is the significance level as covered in hypothesis testing, and μ is the actual population mean (not the sample mean). This is a the statement of there being a [(1-α)*100]% probability that the true population mean will be somewhere between a and b.  The obvious question is "How do we find a and b?".  Here, I will describe the process. Step 1. Find the Fundamental Statistics The first thing we need to find the fundamental statistics , the mean, standard deviation, and the sample size.  The sample mean is typically referred to as the point estimate by most statistics text books.  This is because the point estimate of the populati

Basic Statistics Lecture #5: Baye's Theorem

As promised last time , I am going to cover Baye's Theorem. If Tree diagram is the common name for Bayes Theorem.  Recall that conditional probability is given by $P(A \mid B) = \frac{P(A \wedge B)}{P(B)}$.   For tree diagrams, let's say that we have events A, B 1 , B 2 , B 3 , … (the reason we have multiple B's is because they all are within the same family of events) such that the events in the family of B are mutually exclusive and the sum of the probabilities of the events in the family of B are equal to 1. Then we have $$P(B_i \mid A)= \frac{P(B_i)*P(A \mid B_i)}{\sum_{m=1}^{n}[P(B_m)*P(A \mid B_m)]}$$  What this means is reliant on the tree diagram. If we are only looking at the sub-items of A, this is what the tree diagram would look like. If J has a probability of 100%, and P(C) and P(D) are not 0, then when we are trying to find the probability of any of the B's being true given that A is true, we have to set the probability of A to be the entir