Skip to main content

Basic Statistics Lecture #7: Quantitative, Continuous, and Numerical Data

As promised last time, today I will cover basic calculations of data accumulated from real data.  Please take note that all of the following is for the simple case of one group of data.  For two or more distinct groups of data, the calculations will be similar, but slightly more specific due to the nature of 2+ distinct groups of data.  I will cover that in a later post, which will be labeled as ANOVA.  As a side note, I'm a baseball fan, so I'm going to provide examples from the MLB.

This information has the labels for the data of a sample, not the population.  The population is the set of all possible people or objects which falls under the category under study.  If we were studying the 2017 ERA's of pitchers, the population would be all MLB pitchers who have pitched in 2017.  The sample is the subset of the population which we are getting the data points from.  If we want to look at the 8 teams who have made it to the Division Series, then the sample is the pitchers from those 8 teams.

The following will allow us to create a graph of the quantitative data; place in one of three categories; centered, right-skewed, or left skewed.  Centered means are where mean is approximately equal to the median.  Skewed right is where the mean is much greater than the median, and skewed-left is where the mean is much less than the median.  In skewed graphs, there will be outliers.  A centered bell curve is where data comes from a normal distribution.  It's best to use the mean as well as the variance and standard deviation to describe the center and spread of data, respectively.  Skewed bell curves may not be from a normal distribution (may be exponential, Poisson, or gamma distributions).  Here, it is best to use quartiles, min and max, to describe the center and spread.  We should also produce a box-plot to identify any outliers.

The arithmetic mean (or average) of a sample is given by $\overline{x}=\frac{\sum_{i=1}^{n}x_i}{n}$, where xi is a given data point and n is the sample size.  The ERA of a given pitcher is 9 times (because there are 9 innings per game) the average runs given up by the pitcher which had nothing to do with fielding errors over the number of innings pitched by the pitcher.

Not this variety of mean.

The variance is given by $s^{2}=\frac{\sum_{i=1}^{n} (x_i -\overline{x})^2}{n-1}$.  This is a measure of how much the data varies from the average.  The higher the variance, the higher the variability.  If two relief pitchers with the same ERA, I'm more likely to use the one with the lower variance, because there is more of a guarantee that the runs given up will be low.  Especially if the pitchers each have a high number of innings pitched.  The standard deviation is related to the variance, by being the square root of the variance by $s=\sqrt{s^2}=\sqrt{\frac{\sum_{i=1}^{n} (x_i -\overline{x})^2}{n-1}}$.  The mode, if it exists, is the most frequent number which shows up.  After 2 weeks in the MLB, the mode of the total wins of pitchers would likely be either 1 or 2, because those are likely to be the highest frequency of pitcher wins.

When the data is ordered from smallest to largest, the median is defined as the middle point.  When the number of entries is odd (when n=2m+1 for an integer m), the median is simply just the middle number $x_{m+1}$.  When the number of entries is even (2m for an integer m), though, there is no integer middle number.  What is required, then, is to take the average of the middle two numbers, $\frac{x_m + (x_{m+1})}{2}$.

The first quartile is defined as the data point which separates the ordered data into a 1:3 ratio (the 25th percentile).  This means that a quarter (25%) of the data is less than the 1st quartile and three-quarters (75%) of the data is greater than the first quartile.  The third quartile is defined as the data point which separates the data into a 3:1 ratio (75th percentile).  This means that three-quarters (75%) of the data is less than the 1st quartile and a quarter (25%) of the data is greater than the first quartile.  The minimum is defined as the smallest data point, where the maximum is defined as the largest data point.  The interquartile range is given by the difference between the third quartile and the first quartile.

An outlier is any point which is so far away from the cluster that we can justify its removal from the rest of statistical analysis.  It is defined as a sample point which is outside of 1.5 times the interquartile range from the mean [outside the range of $x_o = \overline{x} \pm 1.5 (Q_3 - Q_1)$].  These outliers should still be reported for transparency, but reported with the disclaimer that it is not used in future statistical analysis because something went wrong in collecting that data point and the equation for outliers is sufficient proof of that point.

This is the data typically seen in Box-and-Whisker plots, where the two boxes, taken together and separated by the line of the median, make up the interquartile range, and the two whiskers are the lower quarter and upper quarter.  Any outlier is represented by asterisks or dots or small circles, something point-like.

Visual representation of quartiles and their outliers.

That is the basic data used for statistical analysis.  Next time, I will cover a concept called Anscombe's Quartet, which describes why data alone is insufficient to make statistical analysis.  Until then, stay curious.

K. "Alan" Eister has his bacholers of Science in Chemistry. He is also a tutor for Varsity Tutors.  If you feel you need any further help on any of the topics covered, you can sign up for tutoring sessions here.

Comments

Popular posts from this blog

Basic Statistics Lecture #3: Normal, Binomial, and Poisson Distributions

As I have mentioned last time , the uniform continuous distribution is not the only form of continuous distribution in statistics.  As promised, here are the three most common continuous distribution types.  As a side note, all sampling distributions are relative to the algebraic mean. Normal Distribution: I think most people are familiar with the concept of a normal distribution.  If you've ever seen a bell curve, you've seen the normal distribution.  If you've begun from the first lecture of this lecture series, you've also seen the normal distribution. This type of distribution is where the data points follow a continuous curve, is non-uniform, has a mean (algebraic average) equal to the median (the exact middle value), falls from highest probability at the mean to (for all practical purposes) zero as the x-values approach $\pm \infty$, and therefor has equal number of data points to the left and to the right of the mean, and has the domain of $(\pm \i

Confidence Interval: Basic Statistics Lecture Series Lecture #11

You'll remember last time , I covered hypothesis testing of proportions and the time before that , hypothesis testing of a sample with a mean and standard deviation.  This time, I'll cover the concept of confidence intervals. Confidence intervals are of the form μ 1-α ∈ (a, b) 1-α , where a and b are two numbers such that a<b, α is the significance level as covered in hypothesis testing, and μ is the actual population mean (not the sample mean). This is a the statement of there being a [(1-α)*100]% probability that the true population mean will be somewhere between a and b.  The obvious question is "How do we find a and b?".  Here, I will describe the process. Step 1. Find the Fundamental Statistics The first thing we need to find the fundamental statistics , the mean, standard deviation, and the sample size.  The sample mean is typically referred to as the point estimate by most statistics text books.  This is because the point estimate of the populati

Basic Statistics Lecture #5: Baye's Theorem

As promised last time , I am going to cover Baye's Theorem. If Tree diagram is the common name for Bayes Theorem.  Recall that conditional probability is given by $P(A \mid B) = \frac{P(A \wedge B)}{P(B)}$.   For tree diagrams, let's say that we have events A, B 1 , B 2 , B 3 , … (the reason we have multiple B's is because they all are within the same family of events) such that the events in the family of B are mutually exclusive and the sum of the probabilities of the events in the family of B are equal to 1. Then we have $$P(B_i \mid A)= \frac{P(B_i)*P(A \mid B_i)}{\sum_{m=1}^{n}[P(B_m)*P(A \mid B_m)]}$$  What this means is reliant on the tree diagram. If we are only looking at the sub-items of A, this is what the tree diagram would look like. If J has a probability of 100%, and P(C) and P(D) are not 0, then when we are trying to find the probability of any of the B's being true given that A is true, we have to set the probability of A to be the entir