## What is a Boxplot

A boxplot is a graphical representation used to identify the distribution of a dataset. Boxplot is based on the five main quantities

1) Minimum Value

2) 25^{th} percentile

3) Median

4) 75^{th} percentile

5) Maximum Value

Now, if you don’t know what these five quantities are, together they are called “five-number summary”

Don’t worry, you do not need to calculate these values by hand as R will do it for you.

Case 1 – Consider the height of all the students in your class

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |

170 | 185 | 162 | 169 | 180 | 142 | 159 | 153 | 154 | 180 |

## Code for Boxplot in R

To create the boxplot, we will enter this data as a vector in R

#Create a vector of data values in R

height <- c(170, 185, 162, 169, 180, 142, 159, 153, 154, 180)

#Create histogram using “hist()”

boxplot(height)

Once you run the commands in R, you will get a figure similar to the above. A box with a dark center line and two wings is called a boxplot.

- The wing extending above the boxplot represents the maximum value of the data, and one extending below represents the minimum value in the data.
- The dark line inside the box represents the median value. It is the middle most observation of the data, meaning 50% of the data points will be above, and 50% will be below this line. Here it is approximately 165, meaning the students’ median height is 165cm. 50% of the students have a height above 165cm, and 50% will have a height below 165cm.
- The upper line in the box represents the 75th percentile or third quartile. Here it is approximately 180cm, meaning 25% of the students have a height of more than 180cm, and 75% have a height of less than 180cm. 25% of the data points will lie above the third quartile, and 75% of the data points will lie below it.
- The bottom-most line of the box represents the 25th percentile or first quartile. It is just the opposite of the third quartile. Here, 25% of the data points will lie below the first quartile, and 75% will lie above it. Here it is approximately 155cm, meaning 75% of the students have a height above 155cm, and 25% have a height below 155cm.
- Overall, you can say that the boxplot divides the entire dataset into four pieces, with 25% inside each part.

Case 2 – Consider the height of basketball players in your college (require tall people)

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |

170 | 185 | 175 | 169 | 180 | 142 | 181 | 179 | 182 | 184 |

To create the boxplot, we will enter this data as a vector in R

#Create a vector of data values in R

height <- c(170, 185, 175, 169, 180, 142, 181, 179, 182, 184)

#Create histogram using “hist()”

boxplot(height)

Note: Do not worry about how R calculated all these values. Just focus on interpretation.

- Now observe this boxplot! Did you find it similar to the previous one or different? If you watch closely, you will see that everything is identical, excluding the dot below the boxplot. All the quartiles and minimum & maximum are represented similarly to the previous one. So why do we have this big dot at the bottom? It is called an outlier. Why?
- Imagine a case where the basketball team in your college has all long-height players. Now, there is a student who is not very long and still plays basketball well. So, he got selected for the team. He may have excellent skills in playing basketball, but his short height made him look different from the remaining. This means he is an outlier.
- Remaining things you can interpret the same way as above but look at the outlier! The height is less than 150cm which, compared to other players, is very low. We call it a lower outlier.
- So, to conclude, an outlier is something that you can differentiate from a group. You can imagine this team playing a match, and you will be able to point out this short-height player easily compared to any other player!

Case 3 – Consider the height of rock-climbing players in your college (require short people)

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |

144 | 143 | 142 | 147 | 148 | 142 | 159 | 147 | 172 | 154 |

To create the boxplot, we will enter this data as a vector in R

#Create a vector of data values in R

height <- c(144, 143, 142, 147, 148, 142, 159, 147, 172, 154)

#Create histogram using “hist()”

boxplot(height)

- Now, you can observe this boxplot in the similar way but make the judgement. All things same, you can see the outlier in rock climbing team is above all the other players which means it is an upper outlier.