What you will learn
- calculate the five-number summary for a data set,
- construct and interpret boxplots (box-and-whisker plots),
- use the interquartile range (IQR) to measure spread,
- identify outliers using the rule,
- compare distributions using parallel boxplots,
- organise categorical data in two-way tables,
- critically analyse statistical claims and identify sources of bias.
Two classes sit the same test. Class A scores: . Class B scores: .
- Class A five-number summary: min , , median , , max .
- Class B five-number summary: min , , median , , max .
- Class A has a higher median and smaller IQR () vs Class B ().
- Class A performed more consistently; Class B had wider variation.
Key idea: the boxplot reveals both the typical score (median) and how spread out the scores are (IQR).
1. The five-number summary
The five-number summary consists of:
| Statistic | Meaning |
|---|---|
| Minimum | Smallest value |
| (lower quartile) | Median of the lower half |
| Median () | Middle value |
| (upper quartile) | Median of the upper half |
| Maximum | Largest value |
The interquartile range (IQR) measures the spread of the middle of data:
Data (already ordered): .
There are values.
- Minimum , Maximum .
- Median (average of the 6th and 7th values).
- Lower half: . .
- Upper half: . .
- IQR .
2. Constructing and interpreting boxplots
A boxplot displays the five-number summary graphically:
- The box spans from to and contains the middle of data.
- The line inside the box marks the median.
- The whiskers extend to the minimum and maximum (or to the most extreme non-outlier values).
3. Identifying outliers
An outlier is a value that is unusually far from the rest of the data. The standard rule:
Outlier boundaries
Any data value below the lower fence or above the upper fence is classified as an outlier.
A data set has , , IQR . The maximum value is . Is an outlier?
- Upper fence .
- Since , the value is an outlier.
- On a boxplot, the upper whisker would stop at (or the largest non-outlier value), and would be plotted as an individual dot.
4. Comparing distributions and two-way tables
Parallel boxplots (drawn on the same scale) allow direct comparison of centre, spread, and shape.
When comparing, comment on:
- Centre: which group has a higher/lower median?
- Spread: which group has a larger/smaller IQR?
- Shape: is either distribution symmetric or skewed?
- Outliers: does either group have unusual values?
A two-way table organises categorical data by two variables. It shows frequencies and can reveal associations.
A survey of students asks about pet ownership and gender.
| Owns a pet | No pet | Total | |
|---|---|---|---|
| Female | 32 | 18 | 50 |
| Male | 28 | 22 | 50 |
| Total | 60 | 40 | 100 |
- .
- .
- .
- Females are slightly more likely to own a pet in this sample, but the difference is small.
5. Analysing statistical claims
When evaluating a statistical claim, consider:
- Sample size: is it large enough to be reliable?
- Sampling method: is it random and representative, or biased?
- Measures used: does the claim use mean, median, or mode? Which is most appropriate?
- Visualisation tricks: are axes truncated or scales distorted?
- Causation vs correlation: does the claim imply cause when only association is shown?
A company claims “9 out of 10 dentists recommend our toothpaste.” What questions should you ask?
- How were the dentists selected? (If they were paid by the company, the sample is biased.)
- What was the exact question? (“Do you recommend brushing teeth?” is different from “Do you recommend this specific brand?”)
- How large was the sample? ( dentists is too small to generalise.)
- Were dentists who disagreed excluded from the report?
Practice
Tier 1: basic skills
- Find the five-number summary for: .
- Calculate the IQR for the data in Q1.
- A data set has , . Find the upper and lower fences for outlier detection.
- The five-number summary for a data set is: . Sketch a boxplot.
- A boxplot has its median closer to than to . Is the distribution positively or negatively skewed?
- In a two-way table, out of people surveyed are left-handed. What proportion is left-handed?
- A data set has and IQR . What is ?
- True or false: the median always lies exactly in the centre of the box in a boxplot.
Tier 2: mixed practice
-
The heights (cm) of students are: . Find the five-number summary, identify any outliers, and sketch a boxplot.
-
Two classes have the following five-number summaries for a maths test (out of ):
- Class X: .
- Class Y: . Draw parallel boxplots and write two comparison statements.
-
A two-way table shows transport mode and year level:
Bus Car Walk Total Year 9 30 20 10 60 Year 10 15 35 10 60 Total 45 55 20 120 Find and . What do you notice?
-
A newspaper reports “Average house prices rose by .” Explain why the median might be a better measure than the mean for house prices, and how a few expensive sales could distort the mean.
-
A data set has values: . Show that is an outlier using the rule.
Tier 3: explain and apply
- A study claims students who eat breakfast score higher on tests. The data shows a correlation. Explain why this does not prove causation and suggest a confounding variable.
- Two factories produce bolts. Factory A: median length mm, IQR mm. Factory B: median length mm, IQR mm. Which factory produces more consistent bolts? Which is closer to the target of mm? Discuss trade-offs.
- A survey of people finds that support a new policy. The survey was conducted online and only advertised on one social media platform. Identify two sources of potential bias and explain how each could affect the results.
- Explain the difference between the range and the IQR as measures of spread. Give an example where the range is misleading but the IQR is not.
Challenge
Harder reasoning
- A data set of values has , median , . If the value is added to the data set, explain qualitatively how each part of the five-number summary might change and whether would be classified as an outlier.
- Two data sets both have median and IQR , but one is symmetric and the other is positively skewed. Sketch boxplots for both and explain how the whisker lengths differ.
- A researcher collects data from people and presents a boxplot showing no outliers. A critic argues that with data points, some outliers are expected. Evaluate this argument.
- Design a two-way table for students that shows an association between “plays sport” and “gets more than hours of sleep.” Then modify it so there is no association. Explain the difference.
Answer key
Attempt the practice first. When you're ready to check, expand the answers below.
Show the full answer key
Tier 1
- Min , (median of : average of and ), median (5th value), (median of : average of and ), max .
- IQR .
- IQR . Lower fence . Upper fence .
- Boxplot with whisker at , box from to , median line at , whisker to .
- Positively skewed (the data is more spread out above the median than below).
- or .
- .
- False. The median is only centred if the distribution is symmetric. In a skewed distribution, the median is closer to one quartile.
Tier 2
- Five-number summary: min , (median of positions 1—7), median (8th value), (median of positions 9—15), max . IQR . Upper fence . Lower fence . Both and are below , so there are no outliers.
- Class X has a higher median ( vs ) and a smaller IQR ( vs ). Class X performed better overall and more consistently. Class Y has a higher maximum () but also a lower minimum ( vs — actually Class Y min is higher). Both classes have similar ranges.
- . . Year 9 students are twice as likely to catch the bus as Year 10 students.
- House prices are often positively skewed: most houses cluster around a typical value, but a few very expensive properties pull the mean upward. The median is resistant to extreme values and better represents the “typical” house price. A few multi-million-dollar sales can raise the mean significantly without affecting most buyers’ experience.
- Ordered: . . . IQR . Upper fence . Since , the value is an outlier.
Tier 3
- Correlation does not prove causation because a third variable could explain both. For example, students from families with higher socioeconomic status may be more likely to eat breakfast and have access to tutoring, quiet study spaces, and parental support. The breakfast itself may not cause higher scores; the underlying variable (family resources) may drive both outcomes.
- Factory A is more consistent (IQR mm vs mm). Factory B has a median closer to the target of mm. Trade-off: Factory A produces bolts of very uniform length but slightly above target; Factory B hits the target on average but with much greater variability. If precision matters (e.g. safety-critical components), Factory A is preferable despite the slight offset, which could be corrected by recalibrating.
- Sources of bias: (i) Self-selection bias — only people who chose to respond are counted; those with strong opinions may be overrepresented. (ii) Platform bias — users of that particular social media platform may not be representative of the general population (e.g. younger demographic, specific political leanings). Both could overestimate or underestimate true support depending on the platform’s user base.
- Range uses only the two most extreme values, so a single outlier can make the range very large. IQR uses the middle and is resistant to outliers. Example: . Range (misleadingly large). IQR (reflects the actual spread of most data).
Challenge
- Adding : the minimum stays at (or whatever it was), the maximum becomes . IQR . Upper fence . Since , yes, is an outlier. The median may shift slightly upward (from the average of the 10th and 11th values to the 11th value of the new 21-value set). and may shift slightly but the effect is small.
- Symmetric: both whiskers are approximately equal length, extending evenly from the box. Positively skewed: the right whisker is much longer than the left; data extends further above than below . Both have the same box size (IQR ) and median (), but the skewed version has the median closer to .
- The argument has some merit: in a normal distribution, about of values lie beyond or , so we might expect roughly — outliers. However, if the data is truly free of measurement errors and follows a tight distribution, it is possible (though unlikely) to have no outliers. The researcher should report the distribution shape and explain why outliers are absent.
- With association: Sport-yes/Sleep-yes , Sport-yes/Sleep-no , Sport-no/Sleep-yes , Sport-no/Sleep-no . , . These differ, showing an association. No association: Sport-yes/Sleep-yes , Sport-yes/Sleep-no , Sport-no/Sleep-yes , Sport-no/Sleep-no (using whole numbers: ). Now , so the variables are approximately independent.
Prefer paper? Print the answer key as a separate booklet: open print view ->