Basic Statistics Lecture #7: Quantitative, Continuous, and Numerical Data

As promised last time, today I will cover basic calculations of data accumulated from real data. Please take note that all of the following is for the simple case of one group of data. For two or more distinct groups of data, the calculations will be similar, but slightly more specific due to the nature of 2+ distinct groups of data. I will cover that in a later post, which will be labeled as ANOVA. As a side note, I'm a baseball fan, so I'm going to provide examples from the MLB.

This information has the labels for the data of a sample, not the population. The population is the set of all possible people or objects which falls under the category under study. If we were studying the 2017 ERA's of pitchers, the population would be all MLB pitchers who have pitched in 2017. The sample is the subset of the population which we are getting the data points from. If we want to look at the 8 teams who have made it to the Division Series, then the sample is the pitchers from those 8 teams.

The following will allow us to create a graph of the quantitative data; place in one of three categories; centered, right-skewed, or left skewed. Centered means are where mean is approximately equal to the median. Skewed right is where the mean is much greater than the median, and skewed-left is where the mean is much less than the median. In skewed graphs, there will be outliers. A centered bell curve is where data comes from a normal distribution. It's best to use the mean as well as the variance and standard deviation to describe the center and spread of data, respectively. Skewed bell curves may not be from a normal distribution (may be exponential, Poisson, or gamma distributions). Here, it is best to use quartiles, min and max, to describe the center and spread. We should also produce a box-plot to identify any outliers.

The arithmetic mean (or average) of a sample is given by $\overline{x}=\frac{\sum_{i=1}^{n}x_i}{n}$, where x_i is a given data point and n is the sample size. The ERA of a given pitcher is 9 times (because there are 9 innings per game) the average runs given up by the pitcher which had nothing to do with fielding errors over the number of innings pitched by the pitcher.

Not this variety of mean.

The variance is given by $s^{2}=\frac{\sum_{i=1}^{n} (x_i -\overline{x})^2}{n-1}$. This is a measure of how much the data varies from the average. The higher the variance, the higher the variability. If two relief pitchers with the same ERA, I'm more likely to use the one with the lower variance, because there is more of a guarantee that the runs given up will be low. Especially if the pitchers each have a high number of innings pitched. The standard deviation is related to the variance, by being the square root of the variance by $s=\sqrt{s^2}=\sqrt{\frac{\sum_{i=1}^{n} (x_i -\overline{x})^2}{n-1}}$. The mode, if it exists, is the most frequent number which shows up. After 2 weeks in the MLB, the mode of the total wins of pitchers would likely be either 1 or 2, because those are likely to be the highest frequency of pitcher wins.

When the data is ordered from smallest to largest, the median is defined as the middle point. When the number of entries is odd (when n=2m+1 for an integer m), the median is simply just the middle number $x_{m+1}$. When the number of entries is even (2m for an integer m), though, there is no integer middle number. What is required, then, is to take the average of the middle two numbers, $\frac{x_m + (x_{m+1})}{2}$.

The first quartile is defined as the data point which separates the ordered data into a 1:3 ratio (the 25th percentile). This means that a quarter (25%) of the data is less than the 1st quartile and three-quarters (75%) of the data is greater than the first quartile. The third quartile is defined as the data point which separates the data into a 3:1 ratio (75th percentile). This means that three-quarters (75%) of the data is less than the 1st quartile and a quarter (25%) of the data is greater than the first quartile. The minimum is defined as the smallest data point, where the maximum is defined as the largest data point. The interquartile range is given by the difference between the third quartile and the first quartile.

An outlier is any point which is so far away from the cluster that we can justify its removal from the rest of statistical analysis. It is defined as a sample point which is outside of 1.5 times the interquartile range from the mean [outside the range of $x_o = \overline{x} \pm 1.5 (Q_3 - Q_1)$]. These outliers should still be reported for transparency, but reported with the disclaimer that it is not used in future statistical analysis because something went wrong in collecting that data point and the equation for outliers is sufficient proof of that point.

This is the data typically seen in Box-and-Whisker plots, where the two boxes, taken together and separated by the line of the median, make up the interquartile range, and the two whiskers are the lower quarter and upper quarter. Any outlier is represented by asterisks or dots or small circles, something point-like.

Visual representation of quartiles and their outliers.

That is the basic data used for statistical analysis. Next time, I will cover a concept called Anscombe's Quartet, which describes why data alone is insufficient to make statistical analysis. Until then, stay curious.

K. "Alan" Eister has his bacholers of Science in Chemistry. He is also a tutor for Varsity Tutors. If you feel you need any further help on any of the topics covered, you can sign up for tutoring sessions here.

The Science of Life

Search This Blog

Basic Statistics Lecture #7: Quantitative, Continuous, and Numerical Data

Labels

Comments

Post a Comment

Popular posts from this blog

Multiple Regression: Basic Statistics Lecture Series Lecture #14

The Connections Between the Sciences

Confidence Interval: Basic Statistics Lecture Series Lecture #11