Data analysis and distributions

What you will learn

construct and interpret back-to-back stem-and-leaf plots and comparative histograms,
describe the shape of a distribution: symmetric, positively skewed, negatively skewed, or bimodal,
explain the effect of outliers on the mean, median, and range,
identify and compare sampling methods (random, systematic, stratified, convenience) and recognise bias,
choose appropriate data displays for different data types,
plan and conduct a statistical investigation.

Worked example 0 Real-world example: comparing test scores

Two classes sit the same maths test. Class A scores: $52, 58, 61, 63, 65, 67, 68, 70, 72, 74$ . Class B scores: $40, 55, 60, 62, 64, 66, 68, 70, 85, 90$ .

Class A mean $= \dfrac{650}{10} = 65$ . Class B mean $= \dfrac{660}{10} = 66$ .
Class A median $= \dfrac{65 + 67}{2} = 66$ . Class B median $= \dfrac{64 + 66}{2} = 65$ .
Class A range $= 74 - 52 = 22$ . Class B range $= 90 - 40 = 50$ .
The means are nearly equal, but Class B has much greater spread and potential outliers ( $40$ and $90$ ).

Key idea: similar centres can mask very different spreads — always look at both.

1. Back-to-back stem-and-leaf plots

A back-to-back stem-and-leaf plot displays two data sets sharing a common stem. The leaves for one group extend to the left, and the leaves for the other extend to the right.

Back-to-back stem-and-leaf plot comparing pulse rates (beats per minute) for a fitness group and a control group.

Reading the plot: the fitness group’s pulse rates cluster in the 60s, while the control group’s data is more spread out and shifted higher.

Worked example 1 Constructing a back-to-back stem-and-leaf plot

Group X times (seconds): $23, 25, 28, 31, 34, 35, 37, 42, 45$ . Group Y times: $20, 22, 26, 29, 30, 33, 38, 40, 41, 48$ .

Stems are the tens digits: $2, 3, 4$ .
Write Group X leaves to the left (in descending order away from the stem) and Group Y leaves to the right (in ascending order).

Group X	Stem	Group Y
8 5 3	2	0 2 6 9
7 5 4 1	3	0 3 8
5 2	4	0 1 8

Group X median $= 34$ , Group Y median $= 31.5$ . Group X is slightly slower on average.

2. Shape of distributions

When you look at a histogram or stem-and-leaf plot, describe its shape:

Symmetric: roughly the same on both sides of the centre (mean $\approx$ median).
Positively skewed (right-skewed): a long tail to the right. Most data is on the left. Mean $>$ median.
Negatively skewed (left-skewed): a long tail to the left. Most data is on the right. Mean $<$ median.
Bimodal: two distinct peaks, suggesting two sub-groups in the data.

Worked example 2 Identifying shape

A histogram of house prices in a suburb shows many homes in the $400,000–$600,000 range, fewer in the $600,000–$800,000 range, and a small number above $1,000,000.

The bulk of the data is on the left.
A long tail extends to the right (high-priced homes).
The distribution is positively skewed.
The mean will be pulled higher than the median by the expensive homes, so the median is a better measure of centre for this data.

3. Effect of outliers

An outlier is a data value that lies well outside the main body of the data.

Mean: strongly affected — one extreme value can pull the mean significantly.
Median: resistant — it depends only on the middle value(s), so one outlier barely changes it.
Range: strongly affected — it uses only the maximum and minimum.

Worked example 3 Outlier impact

Data set: $12, 14, 15, 15, 16, 17, 18, 50$ .

Mean $= \dfrac{157}{8} = 19.625$ . Without the outlier $50$ : mean $= \dfrac{107}{7} \approx 15.3$ .
Median $= \dfrac{15 + 16}{2} = 15.5$ . Without $50$ : median $= 15$ . Barely changed.
Range $= 50 - 12 = 38$ . Without $50$ : range $= 18 - 12 = 6$ . Dramatically reduced.

When outliers are present, the median better represents the typical value.

4. Sampling methods and bias

When the population is too large to survey entirely, we take a sample. The method of sampling affects the reliability of conclusions.

Method	Description	Strengths	Weaknesses
Simple random	Every member has an equal chance of selection	Unbiased, representative	Needs a complete list of the population
Systematic	Select every $k$ -th member from a list	Easy to implement	Can miss patterns if the list has a hidden cycle
Stratified	Divide into subgroups (strata), sample proportionally from each	Ensures all subgroups are represented	Requires knowledge of subgroup sizes
Convenience	Choose whoever is easiest to reach	Quick and cheap	Often biased — not representative

Worked example 4 Identifying bias

A school surveys students about favourite sports by asking only those at basketball training.

The sample is convenience — it selects students already interested in basketball.
Basketball is likely to be overrepresented; other sports underrepresented.
A better approach: take a stratified random sample from each year level to capture the full school population.

5. Choosing displays and planning investigations

Different data types suit different displays:

Categorical data: bar chart, pie chart.
Numerical (discrete): dot plot, bar chart.
Numerical (continuous): histogram, stem-and-leaf plot, box plot.
Comparing two groups: back-to-back stem-and-leaf, side-by-side box plots, comparative histograms.

A well-planned statistical investigation follows these steps:

Pose a question that can be answered with data.
Plan data collection — choose sampling method, sample size, and variables.
Collect data systematically.
Analyse — calculate summary statistics, construct appropriate displays.
Conclude — interpret results, acknowledge limitations.

Practice

Fluency

Tier 1: basic skills

Classify each distribution shape: (a) tail on the right, (b) two peaks, (c) mirror-image shape, (d) tail on the left.
Data: $3, 5, 6, 7, 7, 8, 9, 40$ . Find the mean, median, and range.
Remove the outlier from Q2 and recalculate mean, median, and range. Which statistic changed most?
For the data in Q2, which measure of centre better represents the typical value? Explain.
A sample is taken by selecting every 10th student on a school roll. Name this sampling method.
A survey asks 50 people at a train station about their preferred mode of transport. Explain why this sample might be biased.
Construct a stem-and-leaf plot for: $14, 18, 22, 25, 27, 31, 33, 36, 38, 42$ .
What type of display would you use to compare the heights of Year 9 boys and Year 9 girls?
State whether the mean or median is higher for a positively skewed distribution.
A histogram has bars of heights $2, 5, 8, 6, 3, 1$ . Describe the shape of this distribution.

Reasoning

Tier 2: mixed practice

Two classes recorded the number of books read last term. Class A: $2, 3, 3, 4, 5, 5, 6, 7, 8, 12$ . Class B: $1, 2, 4, 5, 5, 6, 6, 7, 7, 8$ . Construct a back-to-back stem-and-leaf plot and compare the distributions.
A data set has mean $24$ and median $18$ . Is the distribution likely symmetric, positively skewed, or negatively skewed? Explain.
A researcher wants to survey $200$ out of $2000$ students about study habits. The school has $800$ Year 7, $700$ Year 8, and $500$ Year 9 students. Calculate how many students should be sampled from each year level using stratified sampling.
Explain why the median is preferred over the mean when reporting typical house prices.
A factory records the time (in seconds) to assemble a part. Morning shift: $42, 44, 45, 46, 47, 48, 50$ . Afternoon shift: $43, 45, 46, 48, 50, 52, 58$ . Compare using mean, median, and range.
Describe a situation where a bimodal distribution would be expected. Explain what causes the two peaks.
A student claims: “My sample of 10 friends is representative of the whole school.” Critique this claim.
State three features you should always comment on when comparing two distributions.

Reasoning

Tier 3: explain and apply

A company reports that the “average salary” is $95,000. The CEO earns $800,000 and the other $19$ employees earn between $50,000 and $70,000 each. Explain how the company’s claim could be technically true but misleading.
Design a statistical investigation to determine whether Year 9 students spend more time on homework than Year 7 students. State the question, sampling method, variables, and how you would display the results.
Two histograms have the same mean and range, but different shapes. Sketch two possible histograms and explain how this is possible.
A data set of $20$ values has mean $30$ . An extra value of $80$ is added. Calculate the new mean and explain why the median might be a better summary.
Explain the difference between a population and a sample. Give an example where surveying the whole population is impractical.

Challenge

Reasoning

Harder reasoning

Two data sets each have $n$ values. Set A has mean $\bar{x}_A$ and set B has mean $\bar{x}_B$ . If the two sets are combined, show that the combined mean is $\dfrac{n\bar{x}_A + n\bar{x}_B}{2n}$ . What happens if the sets have different sizes $n_A$ and $n_B$ ?
A researcher adds a constant $c$ to every value in a data set. How does this affect (a) the mean, (b) the median, (c) the range, (d) the standard deviation? Justify each answer.
Construct a data set of $10$ values where the mean is $50$ , the median is $45$ , and the distribution is positively skewed. Verify your answer.
A school of $1200$ students is surveyed using stratified sampling by year level. Year 7: $350$ , Year 8: $320$ , Year 9: $280$ , Year 10: $250$ . If $120$ students are to be sampled, calculate the number from each year level. One Year 10 student in the sample scored $0$ on the test (absent). Discuss how this outlier should be handled.