Skip to main content

Posts

Showing posts from 2017

Multiple Regression: Basic Statistics Lecture Series Lecture #14

As promised from last time , I am going to cover multiple regression analysis this time.  As mentioned last time, correlation may not imply causation, causation does imply correlation, so correlation is a necessary but insufficient (but still necessary) first step in determining causation.  Since this is a basic statistic lecture series, there is an assumption that matrix algebra is not known to students who take this course, so this section will only be working with the solutions obtained through a program such as Excel , R , SPSS , or MatLab . This is the regression case where there is more than one independent variable, or multiple independent variables, for a single dependent variable.  For example, I mentioned last time that there is a causal correlation between the number of wins a team in the MLB has and the ratio of the runs that team score to the runs that team allowed.  The more runs they scored per run they allowed, the more wins they are likely to have.  After all, the

Regression Analysis: Basic Statistics Lecture Series Lecture #13

As promised last time , I am going to cover the basics of simple regression analysis.  It is simple because there is only one independent variable for the dependent variable. As a side note here, most of you are likely familiar with the phrase "correlation does not imply causation".  This statement, while true, is misleading.  Yes, it is true that not all correlations are causal, it is also true that all causation's are correlated.  Causation cannot occur without correlation, so correlation is the first necessary step to show causation, but it is also an insufficient step. For example, if we look at the graph of MLB wins on the y-axis and the ratio of runs scored to runs allowed on the x-axis, it is easily apparent that the ratio and the number of wins are highly correlated.  This is a causal relationship as well.  The more runs you score and the fewer runs you give up, the more you'll win, because the winner is the team who scores more runs than they give

Types of Statistical Error: Basic Statistics Lecture Series Lecture #12

As promised last time , I will cover types of statistical error this time.  Knowing the magnitude and the type of error is important to convey with any hypothesis test.  This also happens to be why, in science, it is said that nothing can ever truly be proven; only disproven. First, it is important to understand that error typing is an integral part of hypothesis and no other part of statistics, similar to the human brain and the person it's in.  The human brain cannot fit into any other species, and it is necessary for humans to live with it.  The same concept applies with these types of errors and hypothesis; it cannot fit anywhere else, and is necessary for the success of hypothesis testing. So what specifically is hypothesis testing?  It is the chances that the conclusion is incorrect, namely the chances of the null hypothesis is rejected when it's true (Type I Error, false positive) and the chances of failing to reject the null hypothesis when it's false (Typ

Confidence Interval: Basic Statistics Lecture Series Lecture #11

You'll remember last time , I covered hypothesis testing of proportions and the time before that , hypothesis testing of a sample with a mean and standard deviation.  This time, I'll cover the concept of confidence intervals. Confidence intervals are of the form μ 1-α ∈ (a, b) 1-α , where a and b are two numbers such that a<b, α is the significance level as covered in hypothesis testing, and μ is the actual population mean (not the sample mean). This is a the statement of there being a [(1-α)*100]% probability that the true population mean will be somewhere between a and b.  The obvious question is "How do we find a and b?".  Here, I will describe the process. Step 1. Find the Fundamental Statistics The first thing we need to find the fundamental statistics , the mean, standard deviation, and the sample size.  The sample mean is typically referred to as the point estimate by most statistics text books.  This is because the point estimate of the populati

Basics of Statistics Lecture #10: Hypothesis Testing of Proportions

As promised last time , I am going to cover hypothesis testing of proportions.  This is conceptually similar to hypothesis testing from the mean and standard deviation , but the calculations are going to be different. A proportion is the percentage of success in a sample or population, reported in decimal form.  This means that if heads comes up 50% of the time, then it is reported as p=0.50.  Because of this, the calculations are different than that of the mean and standard deviation case from last time.  For instance, when we have proportion data, we don't know the standard deviation of either the sample or the population.  This means that the standard error calculation cannot be performed as typical.  We must instead use a proportion-specific version, which is given by $\sigma_{E}=\sqrt{\frac{p*(1-p)}{n}}$, where p is the proportion and n is the sample size.  If we have the sample size and the number of successes, we can calculate the proportion by $p=\frac{n_s}{n}$, where

Basics of Statistics Lecture #9: Hypothesis Testing of Mean and Standard Deviation

As promised last time , I will introduce the concept of hypothesis testing.  This is finally getting to the meat and potatoes of statistical analysis. There are three things that need to be true in order for a hypothesis to be held true: The data needs to be sampled randomly.  A random sample is exactly what it says on the tin; a group from the population where each individual has an equal chance of being picked. We need to know the sample mean and standard deviation . One or both of the following: Data needs to come from a normal distribution There needs to be a large sample size (for a lot of cases, at least 30, but the more the merrier). There are 5 steps for hypothesis testing from either of two methods.  Statisticians typically distinguish between whether or not we know the population standard deviation.  Realistically, this means whether we know the standard deviation for the population (everyone) or for the sample (small portion of everyone under study).  For

Basic Statistics Lecture #0: Basic Terms and Calculations

This is an entry in the Basic Statistics Lecture Series which I have realized that I have failed to incorporate into the series so far, and I apologize for that.  This lecture is going to cover the basic calculations required for all of statistical analysis, on a fundamental level. When we collect statistical data, we typically collect it on a small part of the population as a whole.  The population is the entire group for which a sentiment holds true.  This small part of the population as a whole is called the sample of the population, or just sample for short.  The number of data points we collect -- the number of items in the sample -- is called the sample size, which is typically denoted by the letter n. In statistics, we typically calculate the arithmetic mean, which is the sum of all values divided by the sample size.  For the population as a whole, this is denoted by the greek letter μ (mu), and for a sample, it's denoted by x . The sample mean is expressed by $\ov

Basic Statistics Lecture #8: Anscombe's Quartet

"There are three kinds of lies: lies, damned lies, and statistics." - Unknown As promised last time , I will be covering Anscombe's Quartet.  It is an idea where people may use statistics to lie about a data set.  It is a series of data sets developed by statistician Francis Anscombe and published in the journal American Statistician in 1973. I'm going to provide you with four sets of data.  Do me a favor and apply what you know of statistical analysis to them.  If the statistical data look weird to you, don't be scared; you may have done the analysis perfectly.  Here's the four sets: I II III IV x y x y x y x y 10 8.04 10 9.14 10 7.46 8 6.58 8 6.95 8 8.14 8 6.77 8 5.76 13 7.58 13 8.74 13 12.74 8 7.71 9 8.81 9 8.77 9 7.11 8 8.84 11 8.33 11 9.26 11 7.81

Basic Statistics Lecture #7: Quantitative, Continuous, and Numerical Data

As promised last time, today I will cover basic calculations of data accumulated from real data.  Please take note that all of the following is for the simple case of one group of data.  For two or more distinct groups of data, the calculations will be similar, but slightly more specific due to the nature of 2+ distinct groups of data.  I will cover that in a later post, which will be labeled as ANOVA.  As a side note, I'm a baseball fan, so I'm going to provide examples from the MLB. This information has the labels for the data of a sample, not the population.  The population is the set of all possible people or objects which falls under the category under study.  If we were studying the 2017 ERA's of pitchers, the population would be all MLB pitchers who have pitched in 2017.  The sample is the subset of the population which we are getting the data points from.  If we want to look at the 8 teams who have made it to the Division Series, then the sample is the pitchers

Basic Statistics Lecture #6: Types of Graphs

As mentioned last time , I am going to cover different types of graphs Pie charts  vs. bar charts: Pie charts are those graphs where the data is represented as slices of a circle.  We've all seen pie graphs, and it is exactly what it says on the tin. Bar graphs are another type of data visualization which most people have seen.  It is where the data is represented as bars, where the x-axis is the independent variable, and the y-axis is the dependent variable. Pie charts have the advantages of displaying categories within one set with respect to one another and being used best with percentages.  They have the disadvantages of needing each category displayed to be read correctly (no truncation) and that all data values must be present.  Bar charts have the advantages of being very flexible (there is no distortions from truncating categories) and they can compare 2 or more data sets.  The disadvantage is that it is not very useful for percentages. Stem-

Basic Statistics Lecture #5: Baye's Theorem

As promised last time , I am going to cover Baye's Theorem. If Tree diagram is the common name for Bayes Theorem.  Recall that conditional probability is given by $P(A \mid B) = \frac{P(A \wedge B)}{P(B)}$.   For tree diagrams, let's say that we have events A, B 1 , B 2 , B 3 , … (the reason we have multiple B's is because they all are within the same family of events) such that the events in the family of B are mutually exclusive and the sum of the probabilities of the events in the family of B are equal to 1. Then we have $$P(B_i \mid A)= \frac{P(B_i)*P(A \mid B_i)}{\sum_{m=1}^{n}[P(B_m)*P(A \mid B_m)]}$$  What this means is reliant on the tree diagram. If we are only looking at the sub-items of A, this is what the tree diagram would look like. If J has a probability of 100%, and P(C) and P(D) are not 0, then when we are trying to find the probability of any of the B's being true given that A is true, we have to set the probability of A to be the entir