Skip to main content

Multiple Regression: Basic Statistics Lecture Series Lecture #14

As promised from last time, I am going to cover multiple regression analysis this time.  As mentioned last time, correlation may not imply causation, causation does imply correlation, so correlation is a necessary but insufficient (but still necessary) first step in determining causation.  Since this is a basic statistic lecture series, there is an assumption that matrix algebra is not known to students who take this course, so this section will only be working with the solutions obtained through a program such as Excel, R, SPSS, or MatLab.

This is the regression case where there is more than one independent variable, or multiple independent variables, for a single dependent variable.  For example, I mentioned last time that there is a causal correlation between the number of wins a team in the MLB has and the ratio of the runs that team score to the runs that team allowed.  The more runs they scored per run they allowed, the more wins they are likely to have.  After all, the definition of a win in any given game is scoring more runs than you allow.

In the basic statistics course, you will be given a few tables to interpret.  When performing regression analysis, there are three sets of information which are relevant; Regression Statistics, ANOVA (which I'll cover next time), and the Coefficients.

For the 2017 MLB season, the regression analysis for wins vs. runs scored and runs allowed looks like this:

Regression Statistics
Multiple R 0.997363151
R-Square 0.994733255
Adjusted R-Square 0.958830871
Standard Error 6.143992455
Observations 30

For simple regression, we had the coefficient of determination R2, which is how well the two variable are correlated.  In multiple regression, there is the same concept, where it shows us how well the dependent variable correlates with all of the proposed independent variables.  In this case, though, we need an adjustment for this coefficient.  This stems from the possibility that one or more of the independent variables may not actually truly aid in the correlation.  One example is the number of strike-outs under the pitcher's belt in the season not physically affecting wins, and shouldn't have any correlation with wins, but if you include strike-outs in the multiple regression for wins, then there is an artificial inflation of the multiple coefficient of determination.

In order to calculate the adjusted multiple coefficient of determination by hand, we must know the regular multiple coefficient of determination, the sample size, and the number of independent variables.  The calculation for the adjusted coefficient is $R_{adj}^{2}=1-\frac{n-1}{n-(k+1)}*(1-R^{2})$, where n is the sample size (for Wins by an MLB, that would be the number of teams in the MLB, or 30) and k is the number of independent variables in the multiple regression (in the multiple regression, it includes runs scored, runs allowed, and anything else you would want to add to the multiple regression model).

When you have a bunch of data for multiple regression, there is also a list of coefficients.  The one which corresponds to "Intercept" is the y-intercept term, the term for the value of the dependent variable when all of the independent variables equal exactly 0.  The other coefficients will be how the corresponding variable changes the dependent variable.  For example, here's the Excel output for multiple regression of wins as a function of runs allowed and runs scored:

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept
0
#N/A #N/A #N/A #N/A #N/A
R/G
24.36261
1.48874
16.36458
7.26E-16
21.31306
27.41216
RA/G
-6.98305
1.485456
-4.70094
6.28E-05
-10.0259
-3.94023

Notice that the intercept is zero; after all, you won't get any wins if both teams have zero runs.  If both teams somehow get no runs, the game just keeps going.  It may be "suspended" and resumed at a later time if the head ump deems it necessary, but if both teams fail to score, neither team logs a win.  In this case, the intercept has a ton of "N/A's" associated with it.  This is not normal; it is merely a relic of the fact that I have intentionally anchored the y-intercept at zero (as explained in the previous paragraph), thereby no statistical analysis can be performed on that point.

How runs scored and runs allowed affect the overall number of wins can also be expressed as a multiple regression equation; games won as a function of runs scored as well as runs allowed, $W=m_{\frac{R_{s}}{R_{a}}}*\frac{R_{s}}{R_{a}}=m_{R_{s}}*R_{s}+m_{R_{a}}*R_{a}$.  This comes from the general equation of a line y=mx+b, where m is the slope of the line and b is the y-intercept, that point where the line crosses the y-axis (or, to put it another way, the point where x=0).  In the case where the runs scored and the runs allowed are on two sides of an addition sign instead of a division line, each m-value is that terms contribution to the total slope, or how strongly the variable affects the total number of wins.  In this case, the values for the m quantities comes from the coefficient quantities, so the equation for multiple regression of wins as a function of runs scored and runs allowed in the 2017 MLB season is $W=24.36261*R_{s}-6.98305*R_{a}+0$.

The standard error column is the calculation of the square root of the average vertical variance of the data points from the regression line, $s_{E}=\sqrt{var}=\sqrt{\frac{\sum_{i=1}^{n}(Y_{i}-\overline{Y})^2}{N}}$.  For any given independent variable, it's a measure of how that variable influences the error of any future predictions.

The t-statistic column represents the calculated t-value as seen in hypothesis testing.  It's calculation is simply the coefficient divided by the standard error, $t_{Stat}=\frac{m_{variable}}{s_{E}}$.  The p-value also comes from hypothesis testing; it is the probability that the regression analysis is bad and is obtained by looking up the value in the t-table.

The lower and upper 95% columns are simply the 95% confidence interval for the particular variable or intercept in question, which is the mean plus or minus the margin of error.  This means that, for any given variable, there is a 95% probability that the actual slope contribution of that variable lies between the lower-95% value and the upper-95% value.

That's it for multiple regression.  If you have any questions, please leave it in the comments.  Next time, I'll cover the concept of ANOVA, which is short for ANalysis Of VAriance.  Until then, stay curious.

K. "Alan" Eister has his bacholers of Science in Chemistry. He is also a tutor for Varsity Tutors.  If you feel you need any further help on any of the topics covered, you can sign up for tutoring sessions here.

Comments

Popular posts from this blog

Basic Statistics Lecture #3: Normal, Binomial, and Poisson Distributions

As I have mentioned last time , the uniform continuous distribution is not the only form of continuous distribution in statistics.  As promised, here are the three most common continuous distribution types.  As a side note, all sampling distributions are relative to the algebraic mean. Normal Distribution: I think most people are familiar with the concept of a normal distribution.  If you've ever seen a bell curve, you've seen the normal distribution.  If you've begun from the first lecture of this lecture series, you've also seen the normal distribution. This type of distribution is where the data points follow a continuous curve, is non-uniform, has a mean (algebraic average) equal to the median (the exact middle value), falls from highest probability at the mean to (for all practical purposes) zero as the x-values approach $\pm \infty$, and therefor has equal number of data points to the left and to the right of the mean, and has the domain of $(\pm \i

Confidence Interval: Basic Statistics Lecture Series Lecture #11

You'll remember last time , I covered hypothesis testing of proportions and the time before that , hypothesis testing of a sample with a mean and standard deviation.  This time, I'll cover the concept of confidence intervals. Confidence intervals are of the form μ 1-α ∈ (a, b) 1-α , where a and b are two numbers such that a<b, α is the significance level as covered in hypothesis testing, and μ is the actual population mean (not the sample mean). This is a the statement of there being a [(1-α)*100]% probability that the true population mean will be somewhere between a and b.  The obvious question is "How do we find a and b?".  Here, I will describe the process. Step 1. Find the Fundamental Statistics The first thing we need to find the fundamental statistics , the mean, standard deviation, and the sample size.  The sample mean is typically referred to as the point estimate by most statistics text books.  This is because the point estimate of the populati

Basic Statistics Lecture #5: Baye's Theorem

As promised last time , I am going to cover Baye's Theorem. If Tree diagram is the common name for Bayes Theorem.  Recall that conditional probability is given by $P(A \mid B) = \frac{P(A \wedge B)}{P(B)}$.   For tree diagrams, let's say that we have events A, B 1 , B 2 , B 3 , … (the reason we have multiple B's is because they all are within the same family of events) such that the events in the family of B are mutually exclusive and the sum of the probabilities of the events in the family of B are equal to 1. Then we have $$P(B_i \mid A)= \frac{P(B_i)*P(A \mid B_i)}{\sum_{m=1}^{n}[P(B_m)*P(A \mid B_m)]}$$  What this means is reliant on the tree diagram. If we are only looking at the sub-items of A, this is what the tree diagram would look like. If J has a probability of 100%, and P(C) and P(D) are not 0, then when we are trying to find the probability of any of the B's being true given that A is true, we have to set the probability of A to be the entir