The Science of Life

Posts

ANOVA And Multiple Data Sets: Basic Statistics Lecture Series Lecture #14

In this last lecture of the Basic Statistics Lecture Series, as promised last time , I will cover ANOVA, or ANalysis Of VAriance. This is a process of comparing 3 or more distinct samples. In the MLB, this means comparing the 6 different divisions of the MLB, or the three leagues of each league (National League or American League). The point of ANOVA is to test whether or not there is at least one inequality between three or more means. This means that there could be as many groups as the experiment you're running requires, and the null hypothesis is always that the mean of every single group has a mean which is statistically equivalent to every other group mean. If there is even one inequality in the group, then the null hypothesis is rejected. As mentioned last time, when we run multiple regression analysis on some data, there is a table called ANOVA. For the baseball data I've been using throughout this entire lecture series, that table looks li...

Multiple Regression: Basic Statistics Lecture Series Lecture #14

As promised from last time , I am going to cover multiple regression analysis this time. As mentioned last time, correlation may not imply causation, causation does imply correlation, so correlation is a necessary but insufficient (but still necessary) first step in determining causation. Since this is a basic statistic lecture series, there is an assumption that matrix algebra is not known to students who take this course, so this section will only be working with the solutions obtained through a program such as Excel , R , SPSS , or MatLab . This is the regression case where there is more than one independent variable, or multiple independent variables, for a single dependent variable. For example, I mentioned last time that there is a causal correlation between the number of wins a team in the MLB has and the ratio of the runs that team score to the runs that team allowed. The more runs they scored per run they allowed, the more wins they are likely to hav...

Regression Analysis: Basic Statistics Lecture Series Lecture #13

As promised last time , I am going to cover the basics of simple regression analysis. It is simple because there is only one independent variable for the dependent variable. As a side note here, most of you are likely familiar with the phrase "correlation does not imply causation". This statement, while true, is misleading. Yes, it is true that not all correlations are causal, it is also true that all causation's are correlated. Causation cannot occur without correlation, so correlation is the first necessary step to show causation, but it is also an insufficient step. For example, if we look at the graph of MLB wins on the y-axis and the ratio of runs scored to runs allowed on the x-axis, it is easily apparent that the ratio and the number of wins are highly correlated. This is a causal relationship as well. The more runs you score and the fewer runs you give up, the more you'll win, because the winner is the team who scores more runs than...

Types of Statistical Error: Basic Statistics Lecture Series Lecture #12

As promised last time , I will cover types of statistical error this time. Knowing the magnitude and the type of error is important to convey with any hypothesis test. This also happens to be why, in science, it is said that nothing can ever truly be proven; only disproven. First, it is important to understand that error typing is an integral part of hypothesis and no other part of statistics, similar to the human brain and the person it's in. The human brain cannot fit into any other species, and it is necessary for humans to live with it. The same concept applies with these types of errors and hypothesis; it cannot fit anywhere else, and is necessary for the success of hypothesis testing. So what specifically is hypothesis testing? It is the chances that the conclusion is incorrect, namely the chances of the null hypothesis is rejected when it's true (Type I Error, false positive) and the chances of failing to reject the null hypothesis when it'...

Confidence Interval: Basic Statistics Lecture Series Lecture #11

You'll remember last time , I covered hypothesis testing of proportions and the time before that , hypothesis testing of a sample with a mean and standard deviation. This time, I'll cover the concept of confidence intervals. Confidence intervals are of the form μ 1-α ∈ (a, b) 1-α , where a and b are two numbers such that a<b, α is the significance level as covered in hypothesis testing, and μ is the actual population mean (not the sample mean). This is a the statement of there being a [(1-α)*100]% probability that the true population mean will be somewhere between a and b. The obvious question is "How do we find a and b?". Here, I will describe the process. Step 1. Find the Fundamental Statistics The first thing we need to find the fundamental statistics , the mean, standard deviation, and the sample size. The sample mean is typically referred to as the point estimate by most statistics text books. This is because the point estimate of the po...

Basics of Statistics Lecture #10: Hypothesis Testing of Proportions

As promised last time , I am going to cover hypothesis testing of proportions. This is conceptually similar to hypothesis testing from the mean and standard deviation , but the calculations are going to be different. A proportion is the percentage of success in a sample or population, reported in decimal form. This means that if heads comes up 50% of the time, then it is reported as p=0.50. Because of this, the calculations are different than that of the mean and standard deviation case from last time. For instance, when we have proportion data, we don't know the standard deviation of either the sample or the population. This means that the standard error calculation cannot be performed as typical. We must instead use a proportion-specific version, which is given by $\sigma_{E}=\sqrt{\frac{p*(1-p)}{n}}$, where p is the proportion and n is the sample size. If we have the sample size and the number of successes, we can calculate the proporti...