Skip to main content

Basic Statistics Lecture #0: Basic Terms and Calculations

This is an entry in the Basic Statistics Lecture Series which I have realized that I have failed to incorporate into the series so far, and I apologize for that.  This lecture is going to cover the basic calculations required for all of statistical analysis, on a fundamental level.

When we collect statistical data, we typically collect it on a small part of the population as a whole.  The population is the entire group for which a sentiment holds true.  This small part of the population as a whole is called the sample of the population, or just sample for short.  The number of data points we collect -- the number of items in the sample -- is called the sample size, which is typically denoted by the letter n.

In statistics, we typically calculate the arithmetic mean, which is the sum of all values divided by the sample size.  For the population as a whole, this is denoted by the greek letter μ (mu), and for a sample, it's denoted by x. The sample mean is expressed by $\overline{x}= \frac{\sum_{i=1}^{n}x_{i}}{n}$, where the capital greek letter Sigma denotes that all of the indexed items should be added together.  To put it a different way, take all of the numbers and add them together.  For example, we can take the average of the wins of the 2017 baseball season.  We just add all of the numbers in the wins column, and then divide by 30 (the number of teams in the MLB).  After doing this, the average wins in the season was 81.  This is what we would expect from the 162 game season, seeing as how every win comes with a corresponding loss.

The variance of a sample is a description of how the data deviates from the arithmetic mean.  The first thing you do to calculate this is to subtract the mean from each individual data point.  Then you square each of those answers to make them all positive.  Add all of these squared differences together.  After you finish with all of that, divide the answer by the degrees of freedom, the number of values you are able to arbitrarily change without changing the mean.  The general form of the degrees of freedom is the total sample size less the total number of non-arbitrary groups.  This is denoted as $v=\frac{\sum_{i=1}^{n}(x_i-\overline{x})^{2}}{n-1}$.  In this case, the total sample size is 30 teams, and there is only one non-arbitrary group.  So the degrees of freedom is 30-1=29.  After all is said and done, the variance of wins in the MLB in 2017 is 132.97.

More often than not, though, it is more useful to calculate the standard deviation, which is the square root of the variance.  In this case, that value is 11.53.  This is the same concept as the variance, but it is of the same magnitude of the mean.  The squaring from the variance makes the value drastically different from the mean, but the standard deviation brings it back to a manageable level.  The standard deviation is also more useful in statistical analysis, as we will see in future lectures.

The median of a data set is merely the middle entry of the data set.  The first thing we do is to order the entries from smallest to largest.  Then, if there is an odd number of entries (n=2m+1 for an integer m), then the median is the (m+1)th data point.  If there is an even number of data points (n=2m), than we take the average of the middle two points.  For example, in the 2017 MLB Seaso, there were 30 teams, so we would have to take the 15th lowest number of wins and 16th lowest number of wins and average them together.  Since the Seattle Mariners and the Texas Rangers are tied at 16th with 78, and the Kansas City Royals and Los Angeles Angels of Anaheim are tied at 15 with 80, we would just do a single 80 and a single 78.  So the median -- the average between the two middle values -- would be 79.

The mode is the highest frequency of the same datapoint.  In the case of the 2017 MLB Season, there are three teams with 75 wins; the Baltimore Orioles, the Pittsburg Pirates, and the Oakland Athletics.  Since this is the only instance of three teams having the same win-loss record, this would be the mode of the 2017 MLB Season.

So that's it for the basic calculations of statistics.  If you have any questions, please feel free to ask.  Next time, I will cover the basics of hypothesis testing.  Until then, stay curious.

K. "Alan" Eister has his bacholers of Science in Chemistry. He is also a tutor for Varsity Tutors.  If you feel you need any further help on any of the topics covered, you can sign up for tutoring sessions here.

Comments

Popular posts from this blog

Basic Statistics Lecture #3: Normal, Binomial, and Poisson Distributions

As I have mentioned last time , the uniform continuous distribution is not the only form of continuous distribution in statistics.  As promised, here are the three most common continuous distribution types.  As a side note, all sampling distributions are relative to the algebraic mean. Normal Distribution: I think most people are familiar with the concept of a normal distribution.  If you've ever seen a bell curve, you've seen the normal distribution.  If you've begun from the first lecture of this lecture series, you've also seen the normal distribution. This type of distribution is where the data points follow a continuous curve, is non-uniform, has a mean (algebraic average) equal to the median (the exact middle value), falls from highest probability at the mean to (for all practical purposes) zero as the x-values approach $\pm \infty$, and therefor has equal number of data points to the left and to the right of the mean, and has the domain of $(\pm \i

Confidence Interval: Basic Statistics Lecture Series Lecture #11

You'll remember last time , I covered hypothesis testing of proportions and the time before that , hypothesis testing of a sample with a mean and standard deviation.  This time, I'll cover the concept of confidence intervals. Confidence intervals are of the form μ 1-α ∈ (a, b) 1-α , where a and b are two numbers such that a<b, α is the significance level as covered in hypothesis testing, and μ is the actual population mean (not the sample mean). This is a the statement of there being a [(1-α)*100]% probability that the true population mean will be somewhere between a and b.  The obvious question is "How do we find a and b?".  Here, I will describe the process. Step 1. Find the Fundamental Statistics The first thing we need to find the fundamental statistics , the mean, standard deviation, and the sample size.  The sample mean is typically referred to as the point estimate by most statistics text books.  This is because the point estimate of the populati

Basic Statistics Lecture #5: Baye's Theorem

As promised last time , I am going to cover Baye's Theorem. If Tree diagram is the common name for Bayes Theorem.  Recall that conditional probability is given by $P(A \mid B) = \frac{P(A \wedge B)}{P(B)}$.   For tree diagrams, let's say that we have events A, B 1 , B 2 , B 3 , … (the reason we have multiple B's is because they all are within the same family of events) such that the events in the family of B are mutually exclusive and the sum of the probabilities of the events in the family of B are equal to 1. Then we have $$P(B_i \mid A)= \frac{P(B_i)*P(A \mid B_i)}{\sum_{m=1}^{n}[P(B_m)*P(A \mid B_m)]}$$  What this means is reliant on the tree diagram. If we are only looking at the sub-items of A, this is what the tree diagram would look like. If J has a probability of 100%, and P(C) and P(D) are not 0, then when we are trying to find the probability of any of the B's being true given that A is true, we have to set the probability of A to be the entir