Topic 14 | Statistics & Probability

Data analysis and distributions

Year 9 core: comparing data distributions using back-to-back stem-and-leaf plots and histograms, describing shape and spread, understanding sampling methods and bias, and planning statistical investigations.

50-65 min Printable practice Answer key Challenge included
How to use this page

Read the explanation, work through the examples, then complete the core practice before printing.

Study progress: Not started

What you will learn

Worked example 0 Real-world example: comparing test scores

Two classes sit the same maths test. Class A scores: 52,58,61,63,65,67,68,70,72,7452, 58, 61, 63, 65, 67, 68, 70, 72, 74. Class B scores: 40,55,60,62,64,66,68,70,85,9040, 55, 60, 62, 64, 66, 68, 70, 85, 90.

  1. Class A mean =65010=65= \dfrac{650}{10} = 65. Class B mean =66010=66= \dfrac{660}{10} = 66.
  2. Class A median =65+672=66= \dfrac{65 + 67}{2} = 66. Class B median =64+662=65= \dfrac{64 + 66}{2} = 65.
  3. Class A range =7452=22= 74 - 52 = 22. Class B range =9040=50= 90 - 40 = 50.
  4. The means are nearly equal, but Class B has much greater spread and potential outliers (4040 and 9090).

Key idea: similar centres can mask very different spreads — always look at both.

1. Back-to-back stem-and-leaf plots

A back-to-back stem-and-leaf plot displays two data sets sharing a common stem. The leaves for one group extend to the left, and the leaves for the other extend to the right.

Fitness groupControl groupLeafStemLeaf8 6 254 6 8 88 6 4 2 062 5 6 86 4 2 070 2 4 6 8280 2 4 6 892 4Key: 2 | 6 | 5 means 62 (fitness) and 65 (control)
Back-to-back stem-and-leaf plot comparing pulse rates (beats per minute) for a fitness group and a control group.

Reading the plot: the fitness group’s pulse rates cluster in the 60s, while the control group’s data is more spread out and shifted higher.

Worked example 1 Constructing a back-to-back stem-and-leaf plot

Group X times (seconds): 23,25,28,31,34,35,37,42,4523, 25, 28, 31, 34, 35, 37, 42, 45. Group Y times: 20,22,26,29,30,33,38,40,41,4820, 22, 26, 29, 30, 33, 38, 40, 41, 48.

  1. Stems are the tens digits: 2,3,42, 3, 4.
  2. Write Group X leaves to the left (in descending order away from the stem) and Group Y leaves to the right (in ascending order).
Group XStemGroup Y
8 5 320 2 6 9
7 5 4 130 3 8
5 240 1 8
  1. Group X median =34= 34, Group Y median =31.5= 31.5. Group X is slightly slower on average.

2. Shape of distributions

When you look at a histogram or stem-and-leaf plot, describe its shape:

Worked example 2 Identifying shape

A histogram of house prices in a suburb shows many homes in the $400,000–$600,000 range, fewer in the $600,000–$800,000 range, and a small number above $1,000,000.

  1. The bulk of the data is on the left.
  2. A long tail extends to the right (high-priced homes).
  3. The distribution is positively skewed.
  4. The mean will be pulled higher than the median by the expensive homes, so the median is a better measure of centre for this data.

3. Effect of outliers

An outlier is a data value that lies well outside the main body of the data.

Worked example 3 Outlier impact

Data set: 12,14,15,15,16,17,18,5012, 14, 15, 15, 16, 17, 18, 50.

  1. Mean =1578=19.625= \dfrac{157}{8} = 19.625. Without the outlier 5050: mean =107715.3= \dfrac{107}{7} \approx 15.3.
  2. Median =15+162=15.5= \dfrac{15 + 16}{2} = 15.5. Without 5050: median =15= 15. Barely changed.
  3. Range =5012=38= 50 - 12 = 38. Without 5050: range =1812=6= 18 - 12 = 6. Dramatically reduced.

When outliers are present, the median better represents the typical value.

4. Sampling methods and bias

When the population is too large to survey entirely, we take a sample. The method of sampling affects the reliability of conclusions.

MethodDescriptionStrengthsWeaknesses
Simple randomEvery member has an equal chance of selectionUnbiased, representativeNeeds a complete list of the population
SystematicSelect every kk-th member from a listEasy to implementCan miss patterns if the list has a hidden cycle
StratifiedDivide into subgroups (strata), sample proportionally from eachEnsures all subgroups are representedRequires knowledge of subgroup sizes
ConvenienceChoose whoever is easiest to reachQuick and cheapOften biased — not representative
Worked example 4 Identifying bias

A school surveys students about favourite sports by asking only those at basketball training.

  1. The sample is convenience — it selects students already interested in basketball.
  2. Basketball is likely to be overrepresented; other sports underrepresented.
  3. A better approach: take a stratified random sample from each year level to capture the full school population.

5. Choosing displays and planning investigations

Different data types suit different displays:

A well-planned statistical investigation follows these steps:

  1. Pose a question that can be answered with data.
  2. Plan data collection — choose sampling method, sample size, and variables.
  3. Collect data systematically.
  4. Analyse — calculate summary statistics, construct appropriate displays.
  5. Conclude — interpret results, acknowledge limitations.

Practice

Fluency

Tier 1: basic skills

    1. Classify each distribution shape: (a) tail on the right, (b) two peaks, (c) mirror-image shape, (d) tail on the left.
    2. Data: 3,5,6,7,7,8,9,403, 5, 6, 7, 7, 8, 9, 40. Find the mean, median, and range.
    3. Remove the outlier from Q2 and recalculate mean, median, and range. Which statistic changed most?
    4. For the data in Q2, which measure of centre better represents the typical value? Explain.
    5. A sample is taken by selecting every 10th student on a school roll. Name this sampling method.
    6. A survey asks 50 people at a train station about their preferred mode of transport. Explain why this sample might be biased.
    7. Construct a stem-and-leaf plot for: 14,18,22,25,27,31,33,36,38,4214, 18, 22, 25, 27, 31, 33, 36, 38, 42.
    8. What type of display would you use to compare the heights of Year 9 boys and Year 9 girls?
    9. State whether the mean or median is higher for a positively skewed distribution.
    10. A histogram has bars of heights 2,5,8,6,3,12, 5, 8, 6, 3, 1. Describe the shape of this distribution.
Reasoning

Tier 2: mixed practice

    1. Two classes recorded the number of books read last term. Class A: 2,3,3,4,5,5,6,7,8,122, 3, 3, 4, 5, 5, 6, 7, 8, 12. Class B: 1,2,4,5,5,6,6,7,7,81, 2, 4, 5, 5, 6, 6, 7, 7, 8. Construct a back-to-back stem-and-leaf plot and compare the distributions.
    2. A data set has mean 2424 and median 1818. Is the distribution likely symmetric, positively skewed, or negatively skewed? Explain.
    3. A researcher wants to survey 200200 out of 20002000 students about study habits. The school has 800800 Year 7, 700700 Year 8, and 500500 Year 9 students. Calculate how many students should be sampled from each year level using stratified sampling.
    4. Explain why the median is preferred over the mean when reporting typical house prices.
    5. A factory records the time (in seconds) to assemble a part. Morning shift: 42,44,45,46,47,48,5042, 44, 45, 46, 47, 48, 50. Afternoon shift: 43,45,46,48,50,52,5843, 45, 46, 48, 50, 52, 58. Compare using mean, median, and range.
    6. Describe a situation where a bimodal distribution would be expected. Explain what causes the two peaks.
    7. A student claims: “My sample of 10 friends is representative of the whole school.” Critique this claim.
    8. State three features you should always comment on when comparing two distributions.
Reasoning

Tier 3: explain and apply

    1. A company reports that the “average salary” is $95,000. The CEO earns $800,000 and the other 1919 employees earn between $50,000 and $70,000 each. Explain how the company’s claim could be technically true but misleading.
    2. Design a statistical investigation to determine whether Year 9 students spend more time on homework than Year 7 students. State the question, sampling method, variables, and how you would display the results.
    3. Two histograms have the same mean and range, but different shapes. Sketch two possible histograms and explain how this is possible.
    4. A data set of 2020 values has mean 3030. An extra value of 8080 is added. Calculate the new mean and explain why the median might be a better summary.
    5. Explain the difference between a population and a sample. Give an example where surveying the whole population is impractical.

Challenge

Reasoning

Harder reasoning

    1. Two data sets each have nn values. Set A has mean xˉA\bar{x}_A and set B has mean xˉB\bar{x}_B. If the two sets are combined, show that the combined mean is nxˉA+nxˉB2n\dfrac{n\bar{x}_A + n\bar{x}_B}{2n}. What happens if the sets have different sizes nAn_A and nBn_B?
    2. A researcher adds a constant cc to every value in a data set. How does this affect (a) the mean, (b) the median, (c) the range, (d) the standard deviation? Justify each answer.
    3. Construct a data set of 1010 values where the mean is 5050, the median is 4545, and the distribution is positively skewed. Verify your answer.
    4. A school of 12001200 students is surveyed using stratified sampling by year level. Year 7: 350350, Year 8: 320320, Year 9: 280280, Year 10: 250250. If 120120 students are to be sampled, calculate the number from each year level. One Year 10 student in the sample scored 00 on the test (absent). Discuss how this outlier should be handled.
Answers

Answer key

Attempt the practice first. When you're ready to check, expand the answers below.

Show the full answer key

Tier 1

    1. (a) positively skewed, (b) bimodal, (c) symmetric, (d) negatively skewed.
    2. Mean =858=10.625= \dfrac{85}{8} = 10.625. Median =7+72=7= \dfrac{7+7}{2} = 7. Range =403=37= 40 - 3 = 37.
    3. Without 4040: mean =4576.43= \dfrac{45}{7} \approx 6.43, median =7= 7, range =93=6= 9 - 3 = 6. The range changed most (from 3737 to 66), followed by the mean (from 10.62510.625 to 6.436.43). The median barely changed.
    4. The median (77) better represents the typical value because the outlier (4040) inflates the mean.
    5. Systematic sampling.
    6. People at a train station are more likely to prefer trains, so train travel would be overrepresented. People who drive, cycle, or walk are less likely to be at the station.
    7. Stem | Leaf: 11 | 4  84\;8, 22 | 2  5  72\;5\;7, 33 | 1  3  6  81\;3\;6\;8, 44 | 22.
    8. A back-to-back stem-and-leaf plot or side-by-side box plots would both work well for comparing two numerical distributions.
    9. The mean is higher than the median in a positively skewed distribution (the tail of high values pulls the mean up).
    10. The bars rise then fall with a single peak, so the distribution is approximately symmetric (or very slightly positively skewed if the tail on the right is longer).

Tier 2

    1. Back-to-back stem-and-leaf: Stem 00: Class A leaves 8  7  6  5  5  4  3  3  28\;7\;6\;5\;5\;4\;3\;3\;2 | Class B leaves 1  2  4  5  5  6  6  7  7  81\;2\;4\;5\;5\;6\;6\;7\;7\;8. Stem 11: Class A leaf 22 | Class B (none). Class A has a wider spread (range 1010 vs 77) with an outlier at 1212. Class B is more tightly clustered. Medians are similar (A: 55, B: 5.55.5).
    2. Positively skewed. The mean (2424) is greater than the median (1818), which indicates a tail of high values pulling the mean up.
    3. Total =2000= 2000. Proportions: Year 7 =8002000×200=80= \dfrac{800}{2000} \times 200 = 80, Year 8 =7002000×200=70= \dfrac{700}{2000} \times 200 = 70, Year 9 =5002000×200=50= \dfrac{500}{2000} \times 200 = 50.
    4. House prices are typically positively skewed (a few very expensive houses push the mean up). The median gives a better sense of what a “typical” house costs because it is not affected by the extreme values.
    5. Morning: mean =46= 46, median =46= 46, range =8= 8. Afternoon: mean 48.9\approx 48.9, median =48= 48, range =15= 15. The afternoon shift is slightly slower on average and has more variation, possibly due to the outlier at 5858.
    6. Example: heights of a mixed group of adult men and women. The two peaks correspond to the average female height and the average male height — two overlapping subpopulations create bimodality.
    7. A sample of 1010 friends is a convenience sample that is not random. Friends tend to share interests, backgrounds, and demographics, so the sample is likely biased and not representative of the whole school. A random or stratified sample would be more reliable.
    8. When comparing two distributions, comment on: (i) centre (mean or median), (ii) spread (range or IQR), and (iii) shape (symmetric, skewed, or bimodal). Also note any outliers.

Tier 3

    1. The CEO’s salary of $800,000 pulls the mean up. If the other 1919 earn an average of $60,000, the total is 19×60000+800000=194000019 \times 60\,000 + 800\,000 = 1\,940\,000, giving mean =194000020=97000= \dfrac{1\,940\,000}{20} = 97\,000 dollars. The “average” (mean) is close to $95,000 but the median is around $60,000. Most employees earn far less than the reported average. The company uses the mean to create a misleading impression.
    2. Question: “Do Year 9 students spend more time per week on homework than Year 7 students?” Sampling: stratified random sample of 3030 students from each year level. Variables: year level (categorical), homework hours per week (continuous). Display: side-by-side box plots or back-to-back stem-and-leaf plot. Calculate mean and median for each group and compare.
    3. Example: Histogram 1 is symmetric (bell-shaped). Histogram 2 is bimodal with one peak below the mean and one above. Both can have the same mean (balanced around the centre) and the same range (same min and max) but very different shapes. The bimodal histogram has more data at the extremes and less near the centre.
    4. Original sum =20×30=600= 20 \times 30 = 600. New sum =600+80=680= 600 + 80 = 680. New mean =6802132.4= \dfrac{680}{21} \approx 32.4. The mean increased by 2.42.4. The median changes from the average of the 10th and 11th values to the 11th value — it might increase by only 00 or 11, making it more stable and representative.
    5. A population is the entire group of interest; a sample is a subset selected for study. Example: surveying every one of Australia’s 26\approx 26 million residents about exercise habits is impractical due to cost and time. A representative sample of a few thousand provides useful estimates instead.

Challenge

    1. Combined sum =nxˉA+nxˉB= n\bar{x}_A + n\bar{x}_B. Combined count =2n= 2n. Combined mean =nxˉA+nxˉB2n=xˉA+xˉB2= \dfrac{n\bar{x}_A + n\bar{x}_B}{2n} = \dfrac{\bar{x}_A + \bar{x}_B}{2}. For different sizes: combined mean =nAxˉA+nBxˉBnA+nB= \dfrac{n_A \bar{x}_A + n_B \bar{x}_B}{n_A + n_B}, which is a weighted average of the two means.
    2. (a) Mean increases by cc (every value increases by cc, so the sum increases by ncnc, and the mean by cc). (b) Median increases by cc (the middle value shifts by cc). (c) Range is unchanged (max and min both increase by cc, so their difference is the same). (d) Standard deviation is unchanged (deviations from the mean are the same since both each value and the mean shift by cc).
    3. One possible set: 30,35,38,40,44,46,50,55,62,10030, 35, 38, 40, 44, 46, 50, 55, 62, 100. Sum =500= 500, mean =50= 50. Median =44+462=45= \dfrac{44+46}{2} = 45. The high value 100100 creates a right tail, giving positive skew. Mean >> median, confirming positive skewness.
    4. Proportions: Year 7 =3501200×120=35= \dfrac{350}{1200} \times 120 = 35, Year 8 =3201200×120=32= \dfrac{320}{1200} \times 120 = 32, Year 9 =2801200×120=28= \dfrac{280}{1200} \times 120 = 28, Year 10 =2501200×120=25= \dfrac{250}{1200} \times 120 = 25. The student who scored 00 was absent, not genuinely scoring zero. This value should be treated as missing data and excluded from analysis (or the student should be resurveyed). Including it would unfairly lower Year 10’s statistics and misrepresent that year level’s performance.

Prefer paper? Print the answer key as a separate booklet: open print view ->