Boxplots and distributions - Year 10 Mathematics

What you will learn

calculate the five-number summary for a data set,
construct and interpret boxplots (box-and-whisker plots),
use the interquartile range (IQR) to measure spread,
identify outliers using the $1.5 \times \text{IQR}$ rule,
compare distributions using parallel boxplots,
organise categorical data in two-way tables,
critically analyse statistical claims and identify sources of bias.

Worked example 0 Real-world example: comparing test scores

Two classes sit the same test. Class A scores: $42, 55, 60, 63, 65, 68, 70, 72, 78, 85$ . Class B scores: $35, 50, 52, 58, 60, 62, 65, 80, 82, 95$ .

Class A five-number summary: min $= 42$ , $Q_1 = 60$ , median $= 66.5$ , $Q_3 = 72$ , max $= 85$ .
Class B five-number summary: min $= 35$ , $Q_1 = 52$ , median $= 61$ , $Q_3 = 80$ , max $= 95$ .
Class A has a higher median and smaller IQR ( $72 - 60 = 12$ ) vs Class B ( $80 - 52 = 28$ ).
Class A performed more consistently; Class B had wider variation.

Key idea: the boxplot reveals both the typical score (median) and how spread out the scores are (IQR).

1. The five-number summary

The five-number summary consists of:

Statistic	Meaning
Minimum	Smallest value
$Q_1$ (lower quartile)	Median of the lower half
Median ( $Q_2$ )	Middle value
$Q_3$ (upper quartile)	Median of the upper half
Maximum	Largest value

The interquartile range (IQR) measures the spread of the middle $50\%$ of data:

Interquartile range

$\text{IQR} = Q_3 - Q_1$

Worked example 1 Finding the five-number summary

Data (already ordered): $12, 15, 18, 20, 22, 25, 27, 30, 35, 40, 42, 50$ .

There are $12$ values.

Minimum $= 12$ , Maximum $= 50$ .
Median $= \dfrac{25 + 27}{2} = 26$ (average of the 6th and 7th values).
Lower half: $12, 15, 18, 20, 22, 25$ . $Q_1 = \dfrac{18 + 20}{2} = 19$ .
Upper half: $27, 30, 35, 40, 42, 50$ . $Q_3 = \dfrac{35 + 40}{2} = 37.5$ .
IQR $= 37.5 - 19 = 18.5$ .

2. Constructing and interpreting boxplots

A boxplot displays the five-number summary graphically:

Labelled boxplot showing the five-number summary.

The box spans from $Q_1$ to $Q_3$ and contains the middle $50\%$ of data.
The line inside the box marks the median.
The whiskers extend to the minimum and maximum (or to the most extreme non-outlier values).

3. Identifying outliers

An outlier is a value that is unusually far from the rest of the data. The standard rule:

Outlier boundaries

Lower fence

$\text{Lower fence} = Q_1 - 1.5 \times \text{IQR}$

Upper fence

$\text{Upper fence} = Q_3 + 1.5 \times \text{IQR}$

Any data value below the lower fence or above the upper fence is classified as an outlier.

Worked example 2 Detecting an outlier

A data set has $Q_1 = 19$ , $Q_3 = 37.5$ , IQR $= 18.5$ . The maximum value is $80$ . Is $80$ an outlier?

Upper fence $= 37.5 + 1.5 \times 18.5 = 37.5 + 27.75 = 65.25$ .
Since $80 > 65.25$ , the value $80$ is an outlier.
On a boxplot, the upper whisker would stop at $65.25$ (or the largest non-outlier value), and $80$ would be plotted as an individual dot.

4. Comparing distributions and two-way tables

Parallel boxplots (drawn on the same scale) allow direct comparison of centre, spread, and shape.

When comparing, comment on:

Centre: which group has a higher/lower median?
Spread: which group has a larger/smaller IQR?
Shape: is either distribution symmetric or skewed?
Outliers: does either group have unusual values?

A two-way table organises categorical data by two variables. It shows frequencies and can reveal associations.

Worked example 3 Two-way table

A survey of $100$ students asks about pet ownership and gender.

	Owns a pet	No pet	Total
Female	32	18	50
Male	28	22	50
Total	60	40	100

$P(\text{owns a pet}) = \dfrac{60}{100} = 0.6$ .
$P(\text{owns a pet} \mid \text{female}) = \dfrac{32}{50} = 0.64$ .
$P(\text{owns a pet} \mid \text{male}) = \dfrac{28}{50} = 0.56$ .
Females are slightly more likely to own a pet in this sample, but the difference is small.

5. Analysing statistical claims

When evaluating a statistical claim, consider:

Sample size: is it large enough to be reliable?
Sampling method: is it random and representative, or biased?
Measures used: does the claim use mean, median, or mode? Which is most appropriate?
Visualisation tricks: are axes truncated or scales distorted?
Causation vs correlation: does the claim imply cause when only association is shown?

Worked example 4 Spotting bias

A company claims “9 out of 10 dentists recommend our toothpaste.” What questions should you ask?

How were the dentists selected? (If they were paid by the company, the sample is biased.)
What was the exact question? (“Do you recommend brushing teeth?” is different from “Do you recommend this specific brand?”)
How large was the sample? ( $10$ dentists is too small to generalise.)
Were dentists who disagreed excluded from the report?

Practice

Fluency

Tier 1: basic skills

Find the five-number summary for: $5, 8, 12, 15, 18, 20, 22, 25, 30$ .
Calculate the IQR for the data in Q1.
A data set has $Q_1 = 10$ , $Q_3 = 30$ . Find the upper and lower fences for outlier detection.
The five-number summary for a data set is: $2, 8, 14, 20, 28$ . Sketch a boxplot.
A boxplot has its median closer to $Q_1$ than to $Q_3$ . Is the distribution positively or negatively skewed?
In a two-way table, $40$ out of $100$ people surveyed are left-handed. What proportion is left-handed?
A data set has $Q_1 = 25$ and IQR $= 12$ . What is $Q_3$ ?
True or false: the median always lies exactly in the centre of the box in a boxplot.

Reasoning

Tier 2: mixed practice

The heights (cm) of $15$ students are: $152, 155, 158, 160, 162, 164, 165, 167, 170, 172, 175, 178, 180, 195, 198$ . Find the five-number summary, identify any outliers, and sketch a boxplot.
Two classes have the following five-number summaries for a maths test (out of $50$ ):
- Class X: $15, 28, 35, 40, 48$ .
- Class Y: $20, 25, 30, 42, 50$ . Draw parallel boxplots and write two comparison statements.
A two-way table shows transport mode and year level:

Bus Car Walk Total
Year 9 30 20 10 60
Year 10 15 35 10 60
Total 45 55 20 120

Find $P(\text{bus} \mid \text{Year 9})$ and $P(\text{bus} \mid \text{Year 10})$ . What do you notice?
A newspaper reports “Average house prices rose by $20\%$ .” Explain why the median might be a better measure than the mean for house prices, and how a few expensive sales could distort the mean.
A data set has values: $3, 5, 7, 8, 10, 12, 14, 15, 50$ . Show that $50$ is an outlier using the $1.5 \times \text{IQR}$ rule.

	Bus	Car	Walk	Total
Year 9	30	20	10	60
Year 10	15	35	10	60
Total	45	55	20	120

Reasoning

Tier 3: explain and apply

A study claims students who eat breakfast score higher on tests. The data shows a correlation. Explain why this does not prove causation and suggest a confounding variable.
Two factories produce bolts. Factory A: median length $50.2$ mm, IQR $= 0.8$ mm. Factory B: median length $50.0$ mm, IQR $= 2.5$ mm. Which factory produces more consistent bolts? Which is closer to the target of $50.0$ mm? Discuss trade-offs.
A survey of $200$ people finds that $60\%$ support a new policy. The survey was conducted online and only advertised on one social media platform. Identify two sources of potential bias and explain how each could affect the results.
Explain the difference between the range and the IQR as measures of spread. Give an example where the range is misleading but the IQR is not.

Challenge

Reasoning

Harder reasoning

A data set of $20$ values has $Q_1 = 15$ , median $= 22$ , $Q_3 = 30$ . If the value $60$ is added to the data set, explain qualitatively how each part of the five-number summary might change and whether $60$ would be classified as an outlier.
Two data sets both have median $= 50$ and IQR $= 10$ , but one is symmetric and the other is positively skewed. Sketch boxplots for both and explain how the whisker lengths differ.
A researcher collects data from $500$ people and presents a boxplot showing no outliers. A critic argues that with $500$ data points, some outliers are expected. Evaluate this argument.
Design a two-way table for $80$ students that shows an association between “plays sport” and “gets more than $8$ hours of sleep.” Then modify it so there is no association. Explain the difference.

Answers

Answer key

Attempt the practice first. When you're ready to check, expand the answers below.

Show the full answer key

Tier 1

Min $= 5$ , $Q_1 = 10$ (median of $5, 8, 12, 15$ : average of $8$ and $12$ ), median $= 18$ (5th value), $Q_3 = 23.5$ (median of $20, 22, 25, 30$ : average of $22$ and $25$ ), max $= 30$ .
IQR $= Q_3 - Q_1 = 23.5 - 10 = 13.5$ .
IQR $= 30 - 10 = 20$ . Lower fence $= 10 - 1.5 \times 20 = 10 - 30 = -20$ . Upper fence $= 30 + 1.5 \times 20 = 30 + 30 = 60$ .
Boxplot with whisker at $2$ , box from $8$ to $20$ , median line at $14$ , whisker to $28$ .
Positively skewed (the data is more spread out above the median than below).
$\dfrac{40}{100} = 0.4$ or $40\%$ .
$Q_3 = Q_1 + \text{IQR} = 25 + 12 = 37$ .
False. The median is only centred if the distribution is symmetric. In a skewed distribution, the median is closer to one quartile.

Tier 2

Five-number summary: min $= 152$ , $Q_1 = 160$ (median of positions 1—7), median $= 167$ (8th value), $Q_3 = 178$ (median of positions 9—15), max $= 198$ . IQR $= 178 - 160 = 18$ . Upper fence $= 178 + 1.5 \times 18 = 178 + 27 = 205$ . Lower fence $= 160 - 27 = 133$ . Both $195$ and $198$ are below $205$ , so there are no outliers.
Class X has a higher median ( $35$ vs $30$ ) and a smaller IQR ( $40 - 28 = 12$ vs $42 - 25 = 17$ ). Class X performed better overall and more consistently. Class Y has a higher maximum ( $50$ ) but also a lower minimum ( $20$ vs $15$ — actually Class Y min is higher). Both classes have similar ranges.
$P(\text{bus} \mid \text{Year 9}) = \dfrac{30}{60} = 0.5$ . $P(\text{bus} \mid \text{Year 10}) = \dfrac{15}{60} = 0.25$ . Year 9 students are twice as likely to catch the bus as Year 10 students.
House prices are often positively skewed: most houses cluster around a typical value, but a few very expensive properties pull the mean upward. The median is resistant to extreme values and better represents the “typical” house price. A few multi-million-dollar sales can raise the mean significantly without affecting most buyers’ experience.
Ordered: $3, 5, 7, 8, 10, 12, 14, 15, 50$ . $Q_1 = \dfrac{5 + 7}{2} = 6$ . $Q_3 = \dfrac{14 + 15}{2} = 14.5$ . IQR $= 14.5 - 6 = 8.5$ . Upper fence $= 14.5 + 1.5 \times 8.5 = 14.5 + 12.75 = 27.25$ . Since $50 > 27.25$ , the value $50$ is an outlier.

Tier 3

Correlation does not prove causation because a third variable could explain both. For example, students from families with higher socioeconomic status may be more likely to eat breakfast and have access to tutoring, quiet study spaces, and parental support. The breakfast itself may not cause higher scores; the underlying variable (family resources) may drive both outcomes.
Factory A is more consistent (IQR $= 0.8$ mm vs $2.5$ mm). Factory B has a median closer to the target of $50.0$ mm. Trade-off: Factory A produces bolts of very uniform length but slightly above target; Factory B hits the target on average but with much greater variability. If precision matters (e.g. safety-critical components), Factory A is preferable despite the slight offset, which could be corrected by recalibrating.
Sources of bias: (i) Self-selection bias — only people who chose to respond are counted; those with strong opinions may be overrepresented. (ii) Platform bias — users of that particular social media platform may not be representative of the general population (e.g. younger demographic, specific political leanings). Both could overestimate or underestimate true support depending on the platform’s user base.
Range uses only the two most extreme values, so a single outlier can make the range very large. IQR uses the middle $50\%$ and is resistant to outliers. Example: $\{10, 12, 14, 15, 16, 18, 100\}$ . Range $= 100 - 10 = 90$ (misleadingly large). IQR $= 18 - 12 = 6$ (reflects the actual spread of most data).

Challenge

Adding $60$ : the minimum stays at $15$ (or whatever it was), the maximum becomes $60$ . IQR $= 30 - 15 = 15$ . Upper fence $= 30 + 1.5 \times 15 = 52.5$ . Since $60 > 52.5$ , yes, $60$ is an outlier. The median may shift slightly upward (from the average of the 10th and 11th values to the 11th value of the new 21-value set). $Q_1$ and $Q_3$ may shift slightly but the effect is small.
Symmetric: both whiskers are approximately equal length, extending evenly from the box. Positively skewed: the right whisker is much longer than the left; data extends further above $Q_3$ than below $Q_1$ . Both have the same box size (IQR $= 10$ ) and median ( $50$ ), but the skewed version has the median closer to $Q_1$ .
The argument has some merit: in a normal distribution, about $0.7\%$ of values lie beyond $Q_1 - 1.5 \times \text{IQR}$ or $Q_3 + 1.5 \times \text{IQR}$ , so we might expect roughly $0.007 \times 500 \approx 3$ — $4$ outliers. However, if the data is truly free of measurement errors and follows a tight distribution, it is possible (though unlikely) to have no outliers. The researcher should report the distribution shape and explain why outliers are absent.
With association: Sport-yes/Sleep-yes $= 30$ , Sport-yes/Sleep-no $= 10$ , Sport-no/Sleep-yes $= 15$ , Sport-no/Sleep-no $= 25$ . $P(\text{sleep} \mid \text{sport}) = \dfrac{30}{40} = 0.75$ , $P(\text{sleep} \mid \text{no sport}) = \dfrac{15}{40} = 0.375$ . These differ, showing an association. No association: Sport-yes/Sleep-yes $= 22.5$ , Sport-yes/Sleep-no $= 17.5$ , Sport-no/Sleep-yes $= 22.5$ , Sport-no/Sleep-no $= 17.5$ (using whole numbers: $23, 17, 22, 18$ ). Now $P(\text{sleep} \mid \text{sport}) \approx P(\text{sleep} \mid \text{no sport}) \approx 0.5625$ , so the variables are approximately independent.

Prefer paper? Print the answer key as a separate booklet: open print view ->