If you identify points that fall outside this range, these may be worth additional investigation. Sometimes outliers might be errors that we want to exclude or an anomaly that we don’t want to include in our analysis. But at other times it can reveal insights into special cases in our data that we may not otherwise notice. Outliers are extreme values that differ from most values in the dataset.
Alternative models
Once you’ve identified outliers, you’ll decide what to do with them. Your main options are retaining or removing them from your dataset. This is similar to the choice you’re faced with when dealing with missing data. There are no lower outliers, since there isn’t a number less than -8.5 in the dataset.
What is an Outlier in Statistics? A Definition
It’s best to remove outliers only when you have a sound reason for doing so. For this reason, you should only remove outliers if you have legitimate reasons for doing so. It’s important to document each outlier you remove and your reasons so that other researchers can follow your procedures.
Other students also liked
If you have a small dataset, you may also want to retain as much data as possible to make sure you have enough statistical power. If your dataset ends up containing many outliers, you may need to use a statistical test that’s more robust to them. Go back to your sorted dataset from Step 1 and highlight any values that are greater than the upper fence or less than your lower fence. In practice, it can be difficult to tell different types of outliers apart.
Using the interquartile range
Outliers, being the most extreme observations, may include the sample maximum or sample minimum, or both, depending on whether they are extremely high or low. However, the sample maximum and minimum are not always outliers because they may not be unusually far from other observations. You may use Excel to graph the two least-squares regression lines and compare the slopes and fit of the lines to the data, as shown in Figure 12.17. Being able to identify outliers can help to determine what is typical within the data and what are exceptions. If we don’t have outliers, this can increase our confidence in the consistency of our findings. When outliers exist in our data, it can affect the typical measures that we use to describe it.
There isn’t just one stand-out median (Q2), nor is there a standout upper quartile (Q1) or standout lower quartile (Q3). To see if there is a lowest value outlier, you need to calculate the first part and see if there is a number in the set that satisfies the condition. If a data point (or points) is excluded from the data analysis, this should be clearly stated on any subsequent report. We divide by (n – 2) because the regression model involves two estimates.
They can have a big impact on your statistical analyses and skew the results of any hypothesis tests. Outliers are an integral part of data analysis and should not be overlooked. Identifying and understanding the nature of outliers is crucial for accurate data analysis and interpretation. Whether they are removed or adjusted, outliers must be carefully considered to ensure the integrity and reliability of statistical conclusions. Sometimes, they should not be included in the analysis of the data, like if it is possible that an outlier is a result of incorrect data. Other times, an outlier may hold valuable information about the population under study and should remain included in the data.
Two potential sources are missing data and errors in data entry or recording. For example, when measuring blood pressure, your doctor likely has a good idea of what is considered to be within the normal blood pressure range. If they were looking at the values above, they would identify that all of the values that are highlighted orange indicate high blood pressure. You have a couple of extreme values in your dataset, so you’ll use the IQR method to check whether they are outliers.
Some outliers represent natural variations in the population, and they should be left as is in your dataset. Just like with missing values, the most conservative option is to keep outliers in your dataset. Keeping outliers is usually the better option when you’re not sure if they are errors. This method is helpful if you have a few values on the extreme ends of your dataset, but you aren’t sure whether any of them might count as outliers. You can convert extreme data points into z scores that tell you how many standard deviations away they are from the mean.
Rejection of outliers is more acceptable in areas of practice where the underlying model of the process being measured and the usual distribution of measurement error are confidently known. Outliers can occur by chance in any distribution, but they can indicate novel behaviour or structures in the data-set, measurement error, or that the population has a heavy-tailed distribution. A frequent cause of outliers is a mixture of two distributions, which may be two distinct sub-populations, or may indicate ‘correct trial’ versus ‘measurement error’; this is modeled by a mixture model.
You can use software to visualise your data with a box plot, or a box-and-whisker plot, so you can see the data distribution at a glance. This type of chart highlights minimum and maximum values (the range), the median, and the interquartile range for your data. The average is much lower when you include the outlier compared to when you exclude it. Your standard deviation also increases when you include the outlier, so your statistical power is lower as well. True outliers are also present in variables with skewed distributions where many data points are spread far from the mean in one direction. It’s important to select appropriate statistical tests or measures when you have a skewed distribution or many outliers.
The standard deviation used is the standard deviation of the residuals or errors. If the sample size is only 100, however, just three such outliers are already reason for concern, being more than 11 times the expected number. Naive interpretation of statistics derived from data sets that include outliers may be misleading. As illustrated in this case, outliers may indicate data points that belong to a different population than the rest of the sample set. Besides outliers, a sample may contain one or a few points that are called influential points. Influential points are observed data points that are far from the other observed data points in the horizontal direction.
In this article you learned how to find the interquartile range in a dataset and in that way calculate any outliers. More specifically, the data point needs to fall more than 1.5 times the Interquartile range above the third quartile to be considered a high outlier. This means that a data point needs to fall more than 1.5 times the Interquartile range below the first quartile to be considered a low outlier. Outliers are extreme values that stand out greatly from the overall pattern of values in a dataset or graph.
A physical apparatus for taking measurements may have suffered a transient malfunction. There may have been an error in data transmission or transcription. Outliers arise due to changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. A sample may have been contaminated with elements from outside the population being examined. Alternatively, an outlier could be the result of a flaw in the assumed theory, calling for further investigation by the researcher. In the third exam/final exam example, you can determine whether there is an outlier.
With regard to the TI-83, 83+, or 84+ calculators, the graphical approach is easier. The graphical procedure is shown first, followed by the numerical calculations. You also want to examine how the correlation coefficient, r, has changed.
You can see that the second graph shows less deviation from the line of best fit. It is clear that omission of the influential point produced a line of best fit that more closely models the data. If we do identify them it’s important to attempt to identify why they may have occurred.
- For the example, if any of the |y – ŷ| values are at least 32.94, the corresponding (x, y) data point is a potential outlier.
- With regard to the TI-83, 83+, or 84+ calculators, the graphical approach is easier.
- You can convert extreme data points into z scores that tell you how many standard deviations away they are from the mean.
- When using statistical indicators we typically define outliers in reference to the data we are using.
- These extreme values can impact your statistical power as well, making it hard to detect a true effect if there is one.
We can do this visually in the scatter plot by drawing an extra pair of lines that are two standard deviations above and below the best-fit line. Any data points outside this extra pair of lines are flagged as potential outliers. Or, we can do this numerically by calculating each residual and comparing it with twice the standard deviation.
In most larger samplings of data, some data points will be further away from the sample mean than what is deemed reasonable. Outlier points can therefore indicate faulty data, erroneous procedures, or areas where a certain theory might not be valid. However, in large samples, a small number of outliers is to be expected (and not due to any anomalous condition). In the example, notice the pattern of the points compared with the line. Although the correlation coefficient is significant, the pattern in the scatter plot indicates that a curve would be a more appropriate model to use than a line.