Data analysis and distributions - Year 9 Mathematics

What you will learn

construct and interpret back-to-back stem-and-leaf plots and comparative histograms,
describe the shape of a distribution: symmetric, positively skewed, negatively skewed, or bimodal,
explain the effect of outliers on the mean, median, and range,
identify and compare sampling methods (random, systematic, stratified, convenience) and recognise bias,
choose appropriate data displays for different data types,
plan and conduct a statistical investigation.

Worked example 0 Real-world example: comparing test scores

Two classes sit the same maths test. Class A scores: $52, 58, 61, 63, 65, 67, 68, 70, 72, 74$ . Class B scores: $40, 55, 60, 62, 64, 66, 68, 70, 85, 90$ .

Class A mean $= \dfrac{650}{10} = 65$ . Class B mean $= \dfrac{660}{10} = 66$ .
Class A median $= \dfrac{65 + 67}{2} = 66$ . Class B median $= \dfrac{64 + 66}{2} = 65$ .
Class A range $= 74 - 52 = 22$ . Class B range $= 90 - 40 = 50$ .
The means are nearly equal, but Class B has much greater spread and potential outliers ( $40$ and $90$ ).

Key idea: similar centres can mask very different spreads — always look at both.

1. Back-to-back stem-and-leaf plots

A back-to-back stem-and-leaf plot displays two data sets sharing a common stem. The leaves for one group extend to the left, and the leaves for the other extend to the right.

Back-to-back stem-and-leaf plot comparing pulse rates (beats per minute) for a fitness group and a control group.

Reading the plot: the fitness group’s pulse rates cluster in the 60s, while the control group’s data is more spread out and shifted higher.

Worked example 1 Constructing a back-to-back stem-and-leaf plot

Group X times (seconds): $23, 25, 28, 31, 34, 35, 37, 42, 45$ . Group Y times: $20, 22, 26, 29, 30, 33, 38, 40, 41, 48$ .

Stems are the tens digits: $2, 3, 4$ .
Write Group X leaves to the left (in descending order away from the stem) and Group Y leaves to the right (in ascending order).

Group X	Stem	Group Y
8 5 3	2	0 2 6 9
7 5 4 1	3	0 3 8
5 2	4	0 1 8

Group X median $= 34$ , Group Y median $= 31.5$ . Group X is slightly slower on average.

2. Shape of distributions

When you look at a histogram or stem-and-leaf plot, describe its shape:

Symmetric: roughly the same on both sides of the centre (mean $\approx$ median).
Positively skewed (right-skewed): a long tail to the right. Most data is on the left. Mean $>$ median.
Negatively skewed (left-skewed): a long tail to the left. Most data is on the right. Mean $<$ median.
Bimodal: two distinct peaks, suggesting two sub-groups in the data.

Worked example 2 Identifying shape

A histogram of house prices in a suburb shows many homes in the $400,000–$600,000 range, fewer in the $600,000–$800,000 range, and a small number above $1,000,000.

The bulk of the data is on the left.
A long tail extends to the right (high-priced homes).
The distribution is positively skewed.
The mean will be pulled higher than the median by the expensive homes, so the median is a better measure of centre for this data.

3. Effect of outliers

An outlier is a data value that lies well outside the main body of the data.

Mean: strongly affected — one extreme value can pull the mean significantly.
Median: resistant — it depends only on the middle value(s), so one outlier barely changes it.
Range: strongly affected — it uses only the maximum and minimum.

Worked example 3 Outlier impact

Data set: $12, 14, 15, 15, 16, 17, 18, 50$ .

Mean $= \dfrac{157}{8} = 19.625$ . Without the outlier $50$ : mean $= \dfrac{107}{7} \approx 15.3$ .
Median $= \dfrac{15 + 16}{2} = 15.5$ . Without $50$ : median $= 15$ . Barely changed.
Range $= 50 - 12 = 38$ . Without $50$ : range $= 18 - 12 = 6$ . Dramatically reduced.

When outliers are present, the median better represents the typical value.

4. Sampling methods and bias

When the population is too large to survey entirely, we take a sample. The method of sampling affects the reliability of conclusions.

Method	Description	Strengths	Weaknesses
Simple random	Every member has an equal chance of selection	Unbiased, representative	Needs a complete list of the population
Systematic	Select every $k$ -th member from a list	Easy to implement	Can miss patterns if the list has a hidden cycle
Stratified	Divide into subgroups (strata), sample proportionally from each	Ensures all subgroups are represented	Requires knowledge of subgroup sizes
Convenience	Choose whoever is easiest to reach	Quick and cheap	Often biased — not representative

Worked example 4 Identifying bias

A school surveys students about favourite sports by asking only those at basketball training.

The sample is convenience — it selects students already interested in basketball.
Basketball is likely to be overrepresented; other sports underrepresented.
A better approach: take a stratified random sample from each year level to capture the full school population.

5. Choosing displays and planning investigations

Different data types suit different displays:

Categorical data: bar chart, pie chart.
Numerical (discrete): dot plot, bar chart.
Numerical (continuous): histogram, stem-and-leaf plot, box plot.
Comparing two groups: back-to-back stem-and-leaf, side-by-side box plots, comparative histograms.

A well-planned statistical investigation follows these steps:

Pose a question that can be answered with data.
Plan data collection — choose sampling method, sample size, and variables.
Collect data systematically.
Analyse — calculate summary statistics, construct appropriate displays.
Conclude — interpret results, acknowledge limitations.

Practice

Fluency

Tier 1: basic skills

Classify each distribution shape: (a) tail on the right, (b) two peaks, (c) mirror-image shape, (d) tail on the left.
Data: $3, 5, 6, 7, 7, 8, 9, 40$ . Find the mean, median, and range.
Remove the outlier from Q2 and recalculate mean, median, and range. Which statistic changed most?
For the data in Q2, which measure of centre better represents the typical value? Explain.
A sample is taken by selecting every 10th student on a school roll. Name this sampling method.
A survey asks 50 people at a train station about their preferred mode of transport. Explain why this sample might be biased.
Construct a stem-and-leaf plot for: $14, 18, 22, 25, 27, 31, 33, 36, 38, 42$ .
What type of display would you use to compare the heights of Year 9 boys and Year 9 girls?
State whether the mean or median is higher for a positively skewed distribution.
A histogram has bars of heights $2, 5, 8, 6, 3, 1$ . Describe the shape of this distribution.

Reasoning

Tier 2: mixed practice

Two classes recorded the number of books read last term. Class A: $2, 3, 3, 4, 5, 5, 6, 7, 8, 12$ . Class B: $1, 2, 4, 5, 5, 6, 6, 7, 7, 8$ . Construct a back-to-back stem-and-leaf plot and compare the distributions.
A data set has mean $24$ and median $18$ . Is the distribution likely symmetric, positively skewed, or negatively skewed? Explain.
A researcher wants to survey $200$ out of $2000$ students about study habits. The school has $800$ Year 7, $700$ Year 8, and $500$ Year 9 students. Calculate how many students should be sampled from each year level using stratified sampling.
Explain why the median is preferred over the mean when reporting typical house prices.
A factory records the time (in seconds) to assemble a part. Morning shift: $42, 44, 45, 46, 47, 48, 50$ . Afternoon shift: $43, 45, 46, 48, 50, 52, 58$ . Compare using mean, median, and range.
Describe a situation where a bimodal distribution would be expected. Explain what causes the two peaks.
A student claims: “My sample of 10 friends is representative of the whole school.” Critique this claim.
State three features you should always comment on when comparing two distributions.

Reasoning

Tier 3: explain and apply

A company reports that the “average salary” is $95,000. The CEO earns $800,000 and the other $19$ employees earn between $50,000 and $70,000 each. Explain how the company’s claim could be technically true but misleading.
Design a statistical investigation to determine whether Year 9 students spend more time on homework than Year 7 students. State the question, sampling method, variables, and how you would display the results.
Two histograms have the same mean and range, but different shapes. Sketch two possible histograms and explain how this is possible.
A data set of $20$ values has mean $30$ . An extra value of $80$ is added. Calculate the new mean and explain why the median might be a better summary.
Explain the difference between a population and a sample. Give an example where surveying the whole population is impractical.

Challenge

Reasoning

Harder reasoning

Two data sets each have $n$ values. Set A has mean $\bar{x}_A$ and set B has mean $\bar{x}_B$ . If the two sets are combined, show that the combined mean is $\dfrac{n\bar{x}_A + n\bar{x}_B}{2n}$ . What happens if the sets have different sizes $n_A$ and $n_B$ ?
A researcher adds a constant $c$ to every value in a data set. How does this affect (a) the mean, (b) the median, (c) the range, (d) the standard deviation? Justify each answer.
Construct a data set of $10$ values where the mean is $50$ , the median is $45$ , and the distribution is positively skewed. Verify your answer.
A school of $1200$ students is surveyed using stratified sampling by year level. Year 7: $350$ , Year 8: $320$ , Year 9: $280$ , Year 10: $250$ . If $120$ students are to be sampled, calculate the number from each year level. One Year 10 student in the sample scored $0$ on the test (absent). Discuss how this outlier should be handled.

Answers

Answer key

Attempt the practice first. When you're ready to check, expand the answers below.

Show the full answer key

Tier 1

(a) positively skewed, (b) bimodal, (c) symmetric, (d) negatively skewed.
Mean $= \dfrac{85}{8} = 10.625$ . Median $= \dfrac{7+7}{2} = 7$ . Range $= 40 - 3 = 37$ .
Without $40$ : mean $= \dfrac{45}{7} \approx 6.43$ , median $= 7$ , range $= 9 - 3 = 6$ . The range changed most (from $37$ to $6$ ), followed by the mean (from $10.625$ to $6.43$ ). The median barely changed.
The median ( $7$ ) better represents the typical value because the outlier ( $40$ ) inflates the mean.
Systematic sampling.
People at a train station are more likely to prefer trains, so train travel would be overrepresented. People who drive, cycle, or walk are less likely to be at the station.
Stem | Leaf: $1$ | $4\;8$ , $2$ | $2\;5\;7$ , $3$ | $1\;3\;6\;8$ , $4$ | $2$ .
A back-to-back stem-and-leaf plot or side-by-side box plots would both work well for comparing two numerical distributions.
The mean is higher than the median in a positively skewed distribution (the tail of high values pulls the mean up).
The bars rise then fall with a single peak, so the distribution is approximately symmetric (or very slightly positively skewed if the tail on the right is longer).

Tier 2

Back-to-back stem-and-leaf: Stem $0$ : Class A leaves $8\;7\;6\;5\;5\;4\;3\;3\;2$ | Class B leaves $1\;2\;4\;5\;5\;6\;6\;7\;7\;8$ . Stem $1$ : Class A leaf $2$ | Class B (none). Class A has a wider spread (range $10$ vs $7$ ) with an outlier at $12$ . Class B is more tightly clustered. Medians are similar (A: $5$ , B: $5.5$ ).
Positively skewed. The mean ( $24$ ) is greater than the median ( $18$ ), which indicates a tail of high values pulling the mean up.
Total $= 2000$ . Proportions: Year 7 $= \dfrac{800}{2000} \times 200 = 80$ , Year 8 $= \dfrac{700}{2000} \times 200 = 70$ , Year 9 $= \dfrac{500}{2000} \times 200 = 50$ .
House prices are typically positively skewed (a few very expensive houses push the mean up). The median gives a better sense of what a “typical” house costs because it is not affected by the extreme values.
Morning: mean $= 46$ , median $= 46$ , range $= 8$ . Afternoon: mean $\approx 48.9$ , median $= 48$ , range $= 15$ . The afternoon shift is slightly slower on average and has more variation, possibly due to the outlier at $58$ .
Example: heights of a mixed group of adult men and women. The two peaks correspond to the average female height and the average male height — two overlapping subpopulations create bimodality.
A sample of $10$ friends is a convenience sample that is not random. Friends tend to share interests, backgrounds, and demographics, so the sample is likely biased and not representative of the whole school. A random or stratified sample would be more reliable.
When comparing two distributions, comment on: (i) centre (mean or median), (ii) spread (range or IQR), and (iii) shape (symmetric, skewed, or bimodal). Also note any outliers.

Tier 3

The CEO’s salary of $800,000 pulls the mean up. If the other $19$ earn an average of $60,000, the total is $19 \times 60\,000 + 800\,000 = 1\,940\,000$ , giving mean $= \dfrac{1\,940\,000}{20} = 97\,000$ dollars. The “average” (mean) is close to $95,000 but the median is around $60,000. Most employees earn far less than the reported average. The company uses the mean to create a misleading impression.
Question: “Do Year 9 students spend more time per week on homework than Year 7 students?” Sampling: stratified random sample of $30$ students from each year level. Variables: year level (categorical), homework hours per week (continuous). Display: side-by-side box plots or back-to-back stem-and-leaf plot. Calculate mean and median for each group and compare.
Example: Histogram 1 is symmetric (bell-shaped). Histogram 2 is bimodal with one peak below the mean and one above. Both can have the same mean (balanced around the centre) and the same range (same min and max) but very different shapes. The bimodal histogram has more data at the extremes and less near the centre.
Original sum $= 20 \times 30 = 600$ . New sum $= 600 + 80 = 680$ . New mean $= \dfrac{680}{21} \approx 32.4$ . The mean increased by $2.4$ . The median changes from the average of the 10th and 11th values to the 11th value — it might increase by only $0$ or $1$ , making it more stable and representative.
A population is the entire group of interest; a sample is a subset selected for study. Example: surveying every one of Australia’s $\approx 26$ million residents about exercise habits is impractical due to cost and time. A representative sample of a few thousand provides useful estimates instead.

Challenge

Combined sum $= n\bar{x}_A + n\bar{x}_B$ . Combined count $= 2n$ . Combined mean $= \dfrac{n\bar{x}_A + n\bar{x}_B}{2n} = \dfrac{\bar{x}_A + \bar{x}_B}{2}$ . For different sizes: combined mean $= \dfrac{n_A \bar{x}_A + n_B \bar{x}_B}{n_A + n_B}$ , which is a weighted average of the two means.
(a) Mean increases by $c$ (every value increases by $c$ , so the sum increases by $nc$ , and the mean by $c$ ). (b) Median increases by $c$ (the middle value shifts by $c$ ). (c) Range is unchanged (max and min both increase by $c$ , so their difference is the same). (d) Standard deviation is unchanged (deviations from the mean are the same since both each value and the mean shift by $c$ ).
One possible set: $30, 35, 38, 40, 44, 46, 50, 55, 62, 100$ . Sum $= 500$ , mean $= 50$ . Median $= \dfrac{44+46}{2} = 45$ . The high value $100$ creates a right tail, giving positive skew. Mean $>$ median, confirming positive skewness.
Proportions: Year 7 $= \dfrac{350}{1200} \times 120 = 35$ , Year 8 $= \dfrac{320}{1200} \times 120 = 32$ , Year 9 $= \dfrac{280}{1200} \times 120 = 28$ , Year 10 $= \dfrac{250}{1200} \times 120 = 25$ . The student who scored $0$ was absent, not genuinely scoring zero. This value should be treated as missing data and excluded from analysis (or the student should be resurveyed). Including it would unfairly lower Year 10’s statistics and misrepresent that year level’s performance.

Prefer paper? Print the answer key as a separate booklet: open print view ->