Outliers in a collection of data are the values which are far away from most other points. A boxplot is usually used to visualize a dataset for spotting unusual data points. However, is an outlier abnormal or normal? It needs to be decided by data analysts.
The boxplot displays five descriptive values which are minimum
, \(Q_1\), median
, \(Q_3\) and maximum
.
The First Quartile and Third Quartile
Place a sample variable into ascending order. Split the sample set into two halves. The first quartile, denoted by \(Q_1\), is the median of the lower half of the set. This means that about 25% of the values are less than \(Q_1\).
The third quartile, denoted by \(Q_3\), is the median of the upper half of the set. This means that about 75% of the values are less than \(Q_3\).
\(IQR\)
An interval, IQR (Inter-Quartile Range)
, is calculated as the difference between \(Q_3\) and \(Q_1\).
Outliers
IQR is often used to filter out outliers. If an observation falls outside of the following interval,
$$ [~Q_1 - 1.5 \times IQR, ~ ~ Q_3 + 1.5 \times IQR~] $$
it is considered as an outlier.
Boxplot Example
It is easy to create a boxplot in R by using either the basic function boxplot or ggplot.
A dataset of 10,000 rows is used here as an example dataset. Two variables, num_of_orders
, sales_total
and gender
are of interest to analysts if they are looking to compare buying behavior between women and men.
Firstly, load the data into R.
sales <- read.csv("data/yearly_sales.csv")
Select the variable sales_total
and inspect the variable by calling the function summary:
summary(sales$sales_total)
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.02 80.29 151.60 249.50 295.50 7606.00
The summary function returns all the five descriptive values for the variable sales_total
. Run summary on gender too.
summary(sales$gender)
F 5035
M 4965
As gender
is a factor of two levels, F
and M
, the summary function returns the number of each level.
Boxplot by using boxplot
The following snippet will create three boxplots of sales_total
by the basic R function boxplot
. Each boxplot has a specific aesthetics setting
1slug: outlier-boxplot
2# set up a layout for plotting
3mat <- matrix(c(1,2,3), nrow=1, ncol=3)
4slug: outlier-boxplot
5layout(mat)
6# 1. boxplot for all customers
7boxplot(sales$sales_total, pch=19, xlab='F and M')
8# 2. boxplot for all customers, log scale
9boxplot(sales$sales_total,pch=19,log='y',xlab='F and M',ylab='The Log of sales_total')
10# 3. one boxplot for each gender level group, log scale
11boxplot(sales$sales_total~sales$gender, pch=19,log='y',col='bisque',xlab='Gender',ylab='The Log of sales_total')
Boxplot by using ggplot
install.packages(“colorspace”)
1# BOXPLOT BY GENDER GROUP
2library(ggplot2)
3library(Rmisc)
4
5p1 <- ggplot(data = sales, aes(x=gender, y=sales_total)) +
6 scale_y_log10() +
7 geom_point(aes(color=gender), alpha=0.2) +
8 geom_boxplot(outlier.size=4, outlier.colour='blue', alpha=0.1)
9
10plot(p1)
Jittering
Noticeably, there is the problem of overplotting with the points in both boxplots. Often, we can add a little random noise to the points, referred to as jittering data. In the geom_point layer of ggplot, assign jitter
to the parameter position
, which is shown in the following ggplot snippet.
p2 <- ggplot(data = sales, aes(x=gender, y=sales_total)) +
scale_y_log10() +
geom_point(aes(color=gender), alpha=0.2, position='jitter') +
geom_boxplot(outlier.size=5, alpha=0.1)
plot(p2)
Share this post
Twitter
Facebook
LinkedIn
Email