What you will learn
- construct and interpret back-to-back stem-and-leaf plots and comparative histograms,
- describe the shape of a distribution: symmetric, positively skewed, negatively skewed, or bimodal,
- explain the effect of outliers on the mean, median, and range,
- identify and compare sampling methods (random, systematic, stratified, convenience) and recognise bias,
- choose appropriate data displays for different data types,
- plan and conduct a statistical investigation.
Two classes sit the same maths test. Class A scores: . Class B scores: .
- Class A mean . Class B mean .
- Class A median . Class B median .
- Class A range . Class B range .
- The means are nearly equal, but Class B has much greater spread and potential outliers ( and ).
Key idea: similar centres can mask very different spreads — always look at both.
1. Back-to-back stem-and-leaf plots
A back-to-back stem-and-leaf plot displays two data sets sharing a common stem. The leaves for one group extend to the left, and the leaves for the other extend to the right.
Reading the plot: the fitness group’s pulse rates cluster in the 60s, while the control group’s data is more spread out and shifted higher.
Group X times (seconds): . Group Y times: .
- Stems are the tens digits: .
- Write Group X leaves to the left (in descending order away from the stem) and Group Y leaves to the right (in ascending order).
| Group X | Stem | Group Y |
|---|---|---|
| 8 5 3 | 2 | 0 2 6 9 |
| 7 5 4 1 | 3 | 0 3 8 |
| 5 2 | 4 | 0 1 8 |
- Group X median , Group Y median . Group X is slightly slower on average.
2. Shape of distributions
When you look at a histogram or stem-and-leaf plot, describe its shape:
- Symmetric: roughly the same on both sides of the centre (mean median).
- Positively skewed (right-skewed): a long tail to the right. Most data is on the left. Mean median.
- Negatively skewed (left-skewed): a long tail to the left. Most data is on the right. Mean median.
- Bimodal: two distinct peaks, suggesting two sub-groups in the data.
A histogram of house prices in a suburb shows many homes in the $400,000–$600,000 range, fewer in the $600,000–$800,000 range, and a small number above $1,000,000.
- The bulk of the data is on the left.
- A long tail extends to the right (high-priced homes).
- The distribution is positively skewed.
- The mean will be pulled higher than the median by the expensive homes, so the median is a better measure of centre for this data.
3. Effect of outliers
An outlier is a data value that lies well outside the main body of the data.
- Mean: strongly affected — one extreme value can pull the mean significantly.
- Median: resistant — it depends only on the middle value(s), so one outlier barely changes it.
- Range: strongly affected — it uses only the maximum and minimum.
Data set: .
- Mean . Without the outlier : mean .
- Median . Without : median . Barely changed.
- Range . Without : range . Dramatically reduced.
When outliers are present, the median better represents the typical value.
4. Sampling methods and bias
When the population is too large to survey entirely, we take a sample. The method of sampling affects the reliability of conclusions.
| Method | Description | Strengths | Weaknesses |
|---|---|---|---|
| Simple random | Every member has an equal chance of selection | Unbiased, representative | Needs a complete list of the population |
| Systematic | Select every -th member from a list | Easy to implement | Can miss patterns if the list has a hidden cycle |
| Stratified | Divide into subgroups (strata), sample proportionally from each | Ensures all subgroups are represented | Requires knowledge of subgroup sizes |
| Convenience | Choose whoever is easiest to reach | Quick and cheap | Often biased — not representative |
A school surveys students about favourite sports by asking only those at basketball training.
- The sample is convenience — it selects students already interested in basketball.
- Basketball is likely to be overrepresented; other sports underrepresented.
- A better approach: take a stratified random sample from each year level to capture the full school population.
5. Choosing displays and planning investigations
Different data types suit different displays:
- Categorical data: bar chart, pie chart.
- Numerical (discrete): dot plot, bar chart.
- Numerical (continuous): histogram, stem-and-leaf plot, box plot.
- Comparing two groups: back-to-back stem-and-leaf, side-by-side box plots, comparative histograms.
A well-planned statistical investigation follows these steps:
- Pose a question that can be answered with data.
- Plan data collection — choose sampling method, sample size, and variables.
- Collect data systematically.
- Analyse — calculate summary statistics, construct appropriate displays.
- Conclude — interpret results, acknowledge limitations.
Practice
Tier 1: basic skills
- Classify each distribution shape: (a) tail on the right, (b) two peaks, (c) mirror-image shape, (d) tail on the left.
- Data: . Find the mean, median, and range.
- Remove the outlier from Q2 and recalculate mean, median, and range. Which statistic changed most?
- For the data in Q2, which measure of centre better represents the typical value? Explain.
- A sample is taken by selecting every 10th student on a school roll. Name this sampling method.
- A survey asks 50 people at a train station about their preferred mode of transport. Explain why this sample might be biased.
- Construct a stem-and-leaf plot for: .
- What type of display would you use to compare the heights of Year 9 boys and Year 9 girls?
- State whether the mean or median is higher for a positively skewed distribution.
- A histogram has bars of heights . Describe the shape of this distribution.
Tier 2: mixed practice
- Two classes recorded the number of books read last term. Class A: . Class B: . Construct a back-to-back stem-and-leaf plot and compare the distributions.
- A data set has mean and median . Is the distribution likely symmetric, positively skewed, or negatively skewed? Explain.
- A researcher wants to survey out of students about study habits. The school has Year 7, Year 8, and Year 9 students. Calculate how many students should be sampled from each year level using stratified sampling.
- Explain why the median is preferred over the mean when reporting typical house prices.
- A factory records the time (in seconds) to assemble a part. Morning shift: . Afternoon shift: . Compare using mean, median, and range.
- Describe a situation where a bimodal distribution would be expected. Explain what causes the two peaks.
- A student claims: “My sample of 10 friends is representative of the whole school.” Critique this claim.
- State three features you should always comment on when comparing two distributions.
Tier 3: explain and apply
- A company reports that the “average salary” is $95,000. The CEO earns $800,000 and the other employees earn between $50,000 and $70,000 each. Explain how the company’s claim could be technically true but misleading.
- Design a statistical investigation to determine whether Year 9 students spend more time on homework than Year 7 students. State the question, sampling method, variables, and how you would display the results.
- Two histograms have the same mean and range, but different shapes. Sketch two possible histograms and explain how this is possible.
- A data set of values has mean . An extra value of is added. Calculate the new mean and explain why the median might be a better summary.
- Explain the difference between a population and a sample. Give an example where surveying the whole population is impractical.
Challenge
Harder reasoning
- Two data sets each have values. Set A has mean and set B has mean . If the two sets are combined, show that the combined mean is . What happens if the sets have different sizes and ?
- A researcher adds a constant to every value in a data set. How does this affect (a) the mean, (b) the median, (c) the range, (d) the standard deviation? Justify each answer.
- Construct a data set of values where the mean is , the median is , and the distribution is positively skewed. Verify your answer.
- A school of students is surveyed using stratified sampling by year level. Year 7: , Year 8: , Year 9: , Year 10: . If students are to be sampled, calculate the number from each year level. One Year 10 student in the sample scored on the test (absent). Discuss how this outlier should be handled.