Topic 14 | Statistics & Probability

Boxplots and distributions

Year 10 core: five-number summary, constructing and interpreting boxplots, IQR and outlier detection, comparing distributions, two-way tables for categorical data, and analysing statistical claims.

50-65 min Printable practice Answer key Challenge included
How to use this page

Read the explanation, work through the examples, then complete the core practice before printing.

Study progress: Not started

What you will learn

Worked example 0 Real-world example: comparing test scores

Two classes sit the same test. Class A scores: 42,55,60,63,65,68,70,72,78,8542, 55, 60, 63, 65, 68, 70, 72, 78, 85. Class B scores: 35,50,52,58,60,62,65,80,82,9535, 50, 52, 58, 60, 62, 65, 80, 82, 95.

  1. Class A five-number summary: min =42= 42, Q1=60Q_1 = 60, median =66.5= 66.5, Q3=72Q_3 = 72, max =85= 85.
  2. Class B five-number summary: min =35= 35, Q1=52Q_1 = 52, median =61= 61, Q3=80Q_3 = 80, max =95= 95.
  3. Class A has a higher median and smaller IQR (7260=1272 - 60 = 12) vs Class B (8052=2880 - 52 = 28).
  4. Class A performed more consistently; Class B had wider variation.

Key idea: the boxplot reveals both the typical score (median) and how spread out the scores are (IQR).

1. The five-number summary

The five-number summary consists of:

StatisticMeaning
MinimumSmallest value
Q1Q_1 (lower quartile)Median of the lower half
Median (Q2Q_2)Middle value
Q3Q_3 (upper quartile)Median of the upper half
MaximumLargest value

The interquartile range (IQR) measures the spread of the middle 50%50\% of data:

Interquartile range

IQR=Q3Q1\text{IQR} = Q_3 - Q_1

Worked example 1 Finding the five-number summary

Data (already ordered): 12,15,18,20,22,25,27,30,35,40,42,5012, 15, 18, 20, 22, 25, 27, 30, 35, 40, 42, 50.

There are 1212 values.

  1. Minimum =12= 12, Maximum =50= 50.
  2. Median =25+272=26= \dfrac{25 + 27}{2} = 26 (average of the 6th and 7th values).
  3. Lower half: 12,15,18,20,22,2512, 15, 18, 20, 22, 25. Q1=18+202=19Q_1 = \dfrac{18 + 20}{2} = 19.
  4. Upper half: 27,30,35,40,42,5027, 30, 35, 40, 42, 50. Q3=35+402=37.5Q_3 = \dfrac{35 + 40}{2} = 37.5.
  5. IQR =37.519=18.5= 37.5 - 19 = 18.5.

2. Constructing and interpreting boxplots

A boxplot displays the five-number summary graphically:

102030405060Min12Q119Median26Q337.5Max50
Labelled boxplot showing the five-number summary.

3. Identifying outliers

An outlier is a value that is unusually far from the rest of the data. The standard rule:

Outlier boundaries

Lower fence

Lower fence=Q11.5×IQR\text{Lower fence} = Q_1 - 1.5 \times \text{IQR}

Upper fence

Upper fence=Q3+1.5×IQR\text{Upper fence} = Q_3 + 1.5 \times \text{IQR}

Any data value below the lower fence or above the upper fence is classified as an outlier.

Worked example 2 Detecting an outlier

A data set has Q1=19Q_1 = 19, Q3=37.5Q_3 = 37.5, IQR =18.5= 18.5. The maximum value is 8080. Is 8080 an outlier?

  1. Upper fence =37.5+1.5×18.5=37.5+27.75=65.25= 37.5 + 1.5 \times 18.5 = 37.5 + 27.75 = 65.25.
  2. Since 80>65.2580 > 65.25, the value 8080 is an outlier.
  3. On a boxplot, the upper whisker would stop at 65.2565.25 (or the largest non-outlier value), and 8080 would be plotted as an individual dot.

4. Comparing distributions and two-way tables

Parallel boxplots (drawn on the same scale) allow direct comparison of centre, spread, and shape.

When comparing, comment on:

A two-way table organises categorical data by two variables. It shows frequencies and can reveal associations.

Worked example 3 Two-way table

A survey of 100100 students asks about pet ownership and gender.

Owns a petNo petTotal
Female321850
Male282250
Total6040100
  1. P(owns a pet)=60100=0.6P(\text{owns a pet}) = \dfrac{60}{100} = 0.6.
  2. P(owns a petfemale)=3250=0.64P(\text{owns a pet} \mid \text{female}) = \dfrac{32}{50} = 0.64.
  3. P(owns a petmale)=2850=0.56P(\text{owns a pet} \mid \text{male}) = \dfrac{28}{50} = 0.56.
  4. Females are slightly more likely to own a pet in this sample, but the difference is small.

5. Analysing statistical claims

When evaluating a statistical claim, consider:

Worked example 4 Spotting bias

A company claims “9 out of 10 dentists recommend our toothpaste.” What questions should you ask?

  1. How were the dentists selected? (If they were paid by the company, the sample is biased.)
  2. What was the exact question? (“Do you recommend brushing teeth?” is different from “Do you recommend this specific brand?”)
  3. How large was the sample? (1010 dentists is too small to generalise.)
  4. Were dentists who disagreed excluded from the report?

Practice

Fluency

Tier 1: basic skills

    1. Find the five-number summary for: 5,8,12,15,18,20,22,25,305, 8, 12, 15, 18, 20, 22, 25, 30.
    2. Calculate the IQR for the data in Q1.
    3. A data set has Q1=10Q_1 = 10, Q3=30Q_3 = 30. Find the upper and lower fences for outlier detection.
    4. The five-number summary for a data set is: 2,8,14,20,282, 8, 14, 20, 28. Sketch a boxplot.
    5. A boxplot has its median closer to Q1Q_1 than to Q3Q_3. Is the distribution positively or negatively skewed?
    6. In a two-way table, 4040 out of 100100 people surveyed are left-handed. What proportion is left-handed?
    7. A data set has Q1=25Q_1 = 25 and IQR =12= 12. What is Q3Q_3?
    8. True or false: the median always lies exactly in the centre of the box in a boxplot.
Reasoning

Tier 2: mixed practice

    1. The heights (cm) of 1515 students are: 152,155,158,160,162,164,165,167,170,172,175,178,180,195,198152, 155, 158, 160, 162, 164, 165, 167, 170, 172, 175, 178, 180, 195, 198. Find the five-number summary, identify any outliers, and sketch a boxplot.

    2. Two classes have the following five-number summaries for a maths test (out of 5050):

      • Class X: 15,28,35,40,4815, 28, 35, 40, 48.
      • Class Y: 20,25,30,42,5020, 25, 30, 42, 50. Draw parallel boxplots and write two comparison statements.
    3. A two-way table shows transport mode and year level:

      BusCarWalkTotal
      Year 930201060
      Year 1015351060
      Total455520120

      Find P(busYear 9)P(\text{bus} \mid \text{Year 9}) and P(busYear 10)P(\text{bus} \mid \text{Year 10}). What do you notice?

    4. A newspaper reports “Average house prices rose by 20%20\%.” Explain why the median might be a better measure than the mean for house prices, and how a few expensive sales could distort the mean.

    5. A data set has values: 3,5,7,8,10,12,14,15,503, 5, 7, 8, 10, 12, 14, 15, 50. Show that 5050 is an outlier using the 1.5×IQR1.5 \times \text{IQR} rule.

Reasoning

Tier 3: explain and apply

    1. A study claims students who eat breakfast score higher on tests. The data shows a correlation. Explain why this does not prove causation and suggest a confounding variable.
    2. Two factories produce bolts. Factory A: median length 50.250.2 mm, IQR =0.8= 0.8 mm. Factory B: median length 50.050.0 mm, IQR =2.5= 2.5 mm. Which factory produces more consistent bolts? Which is closer to the target of 50.050.0 mm? Discuss trade-offs.
    3. A survey of 200200 people finds that 60%60\% support a new policy. The survey was conducted online and only advertised on one social media platform. Identify two sources of potential bias and explain how each could affect the results.
    4. Explain the difference between the range and the IQR as measures of spread. Give an example where the range is misleading but the IQR is not.

Challenge

Reasoning

Harder reasoning

    1. A data set of 2020 values has Q1=15Q_1 = 15, median =22= 22, Q3=30Q_3 = 30. If the value 6060 is added to the data set, explain qualitatively how each part of the five-number summary might change and whether 6060 would be classified as an outlier.
    2. Two data sets both have median =50= 50 and IQR =10= 10, but one is symmetric and the other is positively skewed. Sketch boxplots for both and explain how the whisker lengths differ.
    3. A researcher collects data from 500500 people and presents a boxplot showing no outliers. A critic argues that with 500500 data points, some outliers are expected. Evaluate this argument.
    4. Design a two-way table for 8080 students that shows an association between “plays sport” and “gets more than 88 hours of sleep.” Then modify it so there is no association. Explain the difference.
Answers

Answer key

Attempt the practice first. When you're ready to check, expand the answers below.

Show the full answer key

Tier 1

    1. Min =5= 5, Q1=10Q_1 = 10 (median of 5,8,12,155, 8, 12, 15: average of 88 and 1212), median =18= 18 (5th value), Q3=23.5Q_3 = 23.5 (median of 20,22,25,3020, 22, 25, 30: average of 2222 and 2525), max =30= 30.
    2. IQR =Q3Q1=23.510=13.5= Q_3 - Q_1 = 23.5 - 10 = 13.5.
    3. IQR =3010=20= 30 - 10 = 20. Lower fence =101.5×20=1030=20= 10 - 1.5 \times 20 = 10 - 30 = -20. Upper fence =30+1.5×20=30+30=60= 30 + 1.5 \times 20 = 30 + 30 = 60.
    4. Boxplot with whisker at 22, box from 88 to 2020, median line at 1414, whisker to 2828.
    5. Positively skewed (the data is more spread out above the median than below).
    6. 40100=0.4\dfrac{40}{100} = 0.4 or 40%40\%.
    7. Q3=Q1+IQR=25+12=37Q_3 = Q_1 + \text{IQR} = 25 + 12 = 37.
    8. False. The median is only centred if the distribution is symmetric. In a skewed distribution, the median is closer to one quartile.

Tier 2

    1. Five-number summary: min =152= 152, Q1=160Q_1 = 160 (median of positions 1—7), median =167= 167 (8th value), Q3=178Q_3 = 178 (median of positions 9—15), max =198= 198. IQR =178160=18= 178 - 160 = 18. Upper fence =178+1.5×18=178+27=205= 178 + 1.5 \times 18 = 178 + 27 = 205. Lower fence =16027=133= 160 - 27 = 133. Both 195195 and 198198 are below 205205, so there are no outliers.
    2. Class X has a higher median (3535 vs 3030) and a smaller IQR (4028=1240 - 28 = 12 vs 4225=1742 - 25 = 17). Class X performed better overall and more consistently. Class Y has a higher maximum (5050) but also a lower minimum (2020 vs 1515 — actually Class Y min is higher). Both classes have similar ranges.
    3. P(busYear 9)=3060=0.5P(\text{bus} \mid \text{Year 9}) = \dfrac{30}{60} = 0.5. P(busYear 10)=1560=0.25P(\text{bus} \mid \text{Year 10}) = \dfrac{15}{60} = 0.25. Year 9 students are twice as likely to catch the bus as Year 10 students.
    4. House prices are often positively skewed: most houses cluster around a typical value, but a few very expensive properties pull the mean upward. The median is resistant to extreme values and better represents the “typical” house price. A few multi-million-dollar sales can raise the mean significantly without affecting most buyers’ experience.
    5. Ordered: 3,5,7,8,10,12,14,15,503, 5, 7, 8, 10, 12, 14, 15, 50. Q1=5+72=6Q_1 = \dfrac{5 + 7}{2} = 6. Q3=14+152=14.5Q_3 = \dfrac{14 + 15}{2} = 14.5. IQR =14.56=8.5= 14.5 - 6 = 8.5. Upper fence =14.5+1.5×8.5=14.5+12.75=27.25= 14.5 + 1.5 \times 8.5 = 14.5 + 12.75 = 27.25. Since 50>27.2550 > 27.25, the value 5050 is an outlier.

Tier 3

    1. Correlation does not prove causation because a third variable could explain both. For example, students from families with higher socioeconomic status may be more likely to eat breakfast and have access to tutoring, quiet study spaces, and parental support. The breakfast itself may not cause higher scores; the underlying variable (family resources) may drive both outcomes.
    2. Factory A is more consistent (IQR =0.8= 0.8 mm vs 2.52.5 mm). Factory B has a median closer to the target of 50.050.0 mm. Trade-off: Factory A produces bolts of very uniform length but slightly above target; Factory B hits the target on average but with much greater variability. If precision matters (e.g. safety-critical components), Factory A is preferable despite the slight offset, which could be corrected by recalibrating.
    3. Sources of bias: (i) Self-selection bias — only people who chose to respond are counted; those with strong opinions may be overrepresented. (ii) Platform bias — users of that particular social media platform may not be representative of the general population (e.g. younger demographic, specific political leanings). Both could overestimate or underestimate true support depending on the platform’s user base.
    4. Range uses only the two most extreme values, so a single outlier can make the range very large. IQR uses the middle 50%50\% and is resistant to outliers. Example: {10,12,14,15,16,18,100}\{10, 12, 14, 15, 16, 18, 100\}. Range =10010=90= 100 - 10 = 90 (misleadingly large). IQR =1812=6= 18 - 12 = 6 (reflects the actual spread of most data).

Challenge

    1. Adding 6060: the minimum stays at 1515 (or whatever it was), the maximum becomes 6060. IQR =3015=15= 30 - 15 = 15. Upper fence =30+1.5×15=52.5= 30 + 1.5 \times 15 = 52.5. Since 60>52.560 > 52.5, yes, 6060 is an outlier. The median may shift slightly upward (from the average of the 10th and 11th values to the 11th value of the new 21-value set). Q1Q_1 and Q3Q_3 may shift slightly but the effect is small.
    2. Symmetric: both whiskers are approximately equal length, extending evenly from the box. Positively skewed: the right whisker is much longer than the left; data extends further above Q3Q_3 than below Q1Q_1. Both have the same box size (IQR =10= 10) and median (5050), but the skewed version has the median closer to Q1Q_1.
    3. The argument has some merit: in a normal distribution, about 0.7%0.7\% of values lie beyond Q11.5×IQRQ_1 - 1.5 \times \text{IQR} or Q3+1.5×IQRQ_3 + 1.5 \times \text{IQR}, so we might expect roughly 0.007×50030.007 \times 500 \approx 344 outliers. However, if the data is truly free of measurement errors and follows a tight distribution, it is possible (though unlikely) to have no outliers. The researcher should report the distribution shape and explain why outliers are absent.
    4. With association: Sport-yes/Sleep-yes =30= 30, Sport-yes/Sleep-no =10= 10, Sport-no/Sleep-yes =15= 15, Sport-no/Sleep-no =25= 25. P(sleepsport)=3040=0.75P(\text{sleep} \mid \text{sport}) = \dfrac{30}{40} = 0.75, P(sleepno sport)=1540=0.375P(\text{sleep} \mid \text{no sport}) = \dfrac{15}{40} = 0.375. These differ, showing an association. No association: Sport-yes/Sleep-yes =22.5= 22.5, Sport-yes/Sleep-no =17.5= 17.5, Sport-no/Sleep-yes =22.5= 22.5, Sport-no/Sleep-no =17.5= 17.5 (using whole numbers: 23,17,22,1823, 17, 22, 18). Now P(sleepsport)P(sleepno sport)0.5625P(\text{sleep} \mid \text{sport}) \approx P(\text{sleep} \mid \text{no sport}) \approx 0.5625, so the variables are approximately independent.

Prefer paper? Print the answer key as a separate booklet: open print view ->