Topic 15 | Statistics & Probability

Scatterplots and bivariate data

Year 10 core: bivariate data, constructing scatterplots, describing association (strength, direction, form), line of good fit by eye, interpolation vs extrapolation, and correlation vs causation.

45-60 min Printable practice Answer key Challenge included
How to use this page

Read the explanation, work through the examples, then complete the core practice before printing.

Study progress: Not started

What you will learn

Worked example 0 Real-world example: study time and marks

A teacher records the hours studied and test marks (out of 5050) for 88 students:

Hours12345678
Mark1520222830353842
  1. Plot each pair as a point: (1,15),(2,20),,(8,42)(1, 15), (2, 20), \ldots, (8, 42).
  2. The points rise from left to right: positive association.
  3. They lie close to a straight line: strong, linear association.
  4. A line of good fit passes through approximately (2,20)(2, 20) and (7,38)(7, 38).
  5. Gradient =382072=185=3.6= \dfrac{38 - 20}{7 - 2} = \dfrac{18}{5} = 3.6. Equation: y20=3.6(x2)y - 20 = 3.6(x - 2), so y=3.6x+12.8y = 3.6x + 12.8.

Key idea: we can use the line to predict marks for a given number of hours — but only within the data range.

1. What is bivariate data?

Bivariate data consists of pairs of measurements on two variables for each individual or observation. The explanatory variable (independent) is plotted on the horizontal axis, and the response variable (dependent) is plotted on the vertical axis.

012345678Hours studied01020304050Markline of good fit
Scatterplot with positive linear association and a line of good fit.

2. Describing association

When describing a scatterplot, address three features:

FeatureOptions
DirectionPositive (upward trend) or negative (downward trend)
StrengthStrong (points close to a line), moderate, or weak (points widely scattered)
FormLinear (straight-line pattern) or non-linear (curved pattern)

If there is no discernible pattern, we say there is no association.

Worked example 1 Describing a scatterplot

A scatterplot of car age (years) vs resale value ($) shows points falling from left to right. The points cluster tightly around a curve that flattens out for older cars.

Description: there is a strong, negative, non-linear association between car age and resale value. As car age increases, resale value decreases, but the rate of decrease slows for older cars.

3. Line of good fit by eye

A line of good fit (or trend line) is a straight line drawn through the data that best represents the overall pattern. When drawing by eye:

  1. The line should pass through (or close to) the middle of the data cloud.
  2. Roughly equal numbers of points should be above and below the line.
  3. The line should follow the direction and slope of the data.

Once you have two points on the line, find the equation using gradient-intercept form:

Line equation

y=mx+cy = mx + c

where m=y2y1x2x1m = \dfrac{y_2 - y_1}{x_2 - x_1} is the gradient.

Worked example 2 Finding the equation of a line of good fit

A line of good fit passes through the points (10,45)(10, 45) and (30,25)(30, 25).

  1. Gradient: m=25453010=2020=1m = \dfrac{25 - 45}{30 - 10} = \dfrac{-20}{20} = -1.
  2. Using (10,45)(10, 45): 45=1×10+c45 = -1 \times 10 + c, so c=55c = 55.
  3. Equation: y=x+55y = -x + 55.
  4. Predict the value when x=20x = 20: y=20+55=35y = -20 + 55 = 35.

4. Interpolation vs extrapolation

Worked example 3 Interpolation and extrapolation

Data covers hours studied from 11 to 88, with the line y=3.6x+12.8y = 3.6x + 12.8.

  1. Predict the mark for x=5x = 5 hours: y=3.6×5+12.8=30.8y = 3.6 \times 5 + 12.8 = 30.8. This is interpolation (within range) — reliable.
  2. Predict the mark for x=15x = 15 hours: y=3.6×15+12.8=66.8y = 3.6 \times 15 + 12.8 = 66.8. This is extrapolation (outside range) — unreliable. The test is out of 5050, so 66.866.8 is impossible.

Key idea: always check whether a prediction falls within the data range before trusting it.

5. Correlation vs causation

A strong association (correlation) between two variables does not mean one causes the other. There may be:

To establish causation, you need a properly designed experiment with a control group.

Worked example 4 Correlation is not causation

Data shows a strong positive correlation between ice-cream sales and drowning incidents.

Does ice cream cause drowning? No. The confounding variable is temperature: hot weather increases both ice-cream sales and swimming activity, which increases drowning risk. Ice cream and drowning are correlated but not causally linked.


Practice

Fluency

Tier 1: basic skills

    1. Define bivariate data and give an example of two variables you might investigate.
    2. Which variable goes on the horizontal axis: the explanatory variable or the response variable?
    3. A scatterplot shows points rising steeply from left to right with little scatter. Describe the association.
    4. A scatterplot shows points scattered randomly with no pattern. Describe the association.
    5. A line of good fit passes through (2,10)(2, 10) and (8,28)(8, 28). Find the gradient.
    6. Using the line from Q5, find the equation and predict yy when x=5x = 5.
    7. Is predicting yy for x=5x = 5 (data range 2288) interpolation or extrapolation?
    8. Is predicting yy for x=12x = 12 interpolation or extrapolation?
    9. State whether each is positive or negative association: (a) height and shoe size, (b) altitude and temperature, (c) practice hours and error count.
    10. True or false: a strong correlation between two variables proves that one causes the other.
Reasoning

Tier 2: mixed practice

    1. Plot the following data on a scatterplot and describe the association:

      xx24681012
      yy353024201410
    2. Draw a line of good fit for the data in Q1, find its equation, and predict yy when x=7x = 7.

    3. A researcher finds a strong positive correlation between the number of firefighters at a fire and the amount of damage caused. Does this mean firefighters cause damage? Explain.

    4. Data on advertising spend ($‘000) and sales ($‘000) for 66 months is:

      Advertising51015202530
      Sales4055658090100

      Find the equation of the line of good fit and predict sales for an advertising spend of $18,000.

    5. Explain why extrapolating the line from Q4 to predict sales for $100,000 in advertising is unreliable.

Reasoning

Tier 3: explain and apply

    1. A scatterplot of study hours vs exam mark shows a strong positive linear association for 0066 hours, but the points flatten out beyond 66 hours. Explain this pattern and discuss the limitations of using a single straight line for the entire data set.
    2. Two scatterplots are shown: (A) shows a strong linear pattern; (B) shows a moderate curved pattern. A student claims that (A) always provides better predictions. Evaluate this claim.
    3. Explain the difference between an observed association, a confounding variable, and a causal relationship. Use a real-world example to illustrate all three concepts.
    4. A line of good fit has the equation y=2.5x+80y = -2.5x + 80. The data ranges from x=5x = 5 to x=25x = 25. For what values of xx does the line predict negative yy values? Explain why these predictions are meaningless.

Challenge

Reasoning

Harder reasoning

    1. Two students draw different lines of good fit for the same scatterplot. Student A’s line passes through (5,20)(5, 20) and (15,50)(15, 50). Student B’s line passes through (3,14)(3, 14) and (17,56)(17, 56). Show that both lines have the same gradient but different yy-intercepts. Which line would you trust more and why?
    2. A data set of 1212 points has a strong positive linear association. One additional point is added far from the trend (an outlier). Describe how this outlier could affect (a) the position of the line of good fit, (b) the strength of the association, and (c) predictions made using the line.
    3. A study finds that countries with more mobile phones per person also have higher life expectancy. A journalist writes “Mobile phones increase life expectancy.” Write a critique of this claim, identifying at least two confounding variables and explaining why a controlled experiment would be needed.
    4. The residual for a data point is defined as residual=observed ypredicted y\text{residual} = \text{observed } y - \text{predicted } y. For the data (3,22),(5,30),(7,35),(9,44)(3, 22), (5, 30), (7, 35), (9, 44) and the line y=3x+12y = 3x + 12, calculate each residual. What does the pattern of residuals tell you about the fit?
Answers

Answer key

Attempt the practice first. When you're ready to check, expand the answers below.

Show the full answer key

Tier 1

    1. Bivariate data consists of pairs of measurements on two variables for each individual. Example: height (cm) and weight (kg) for each student in a class.
    2. The explanatory (independent) variable goes on the horizontal axis.
    3. Strong, positive, linear association.
    4. No association (no pattern).
    5. Gradient =281082=186=3= \dfrac{28 - 10}{8 - 2} = \dfrac{18}{6} = 3.
    6. Using (2,10)(2, 10): 10=3×2+c10 = 3 \times 2 + c, so c=4c = 4. Equation: y=3x+4y = 3x + 4. When x=5x = 5: y=3×5+4=19y = 3 \times 5 + 4 = 19.
    7. Interpolation (5 is within the range 2—8).
    8. Extrapolation (12 is outside the range 2—8).
    9. (a) Positive. (b) Negative. (c) Negative.
    10. False. Correlation does not prove causation.

Tier 2

    1. The scatterplot shows a strong, negative, linear association. As xx increases, yy decreases steadily.
    2. A line of good fit through approximately (2,35)(2, 35) and (12,10)(12, 10) gives gradient =1035122=2510=2.5= \dfrac{10 - 35}{12 - 2} = \dfrac{-25}{10} = -2.5. Using (2,35)(2, 35): 35=2.5×2+c35 = -2.5 \times 2 + c, so c=40c = 40. Equation: y=2.5x+40y = -2.5x + 40. When x=7x = 7: y=2.5×7+40=17.5+40=22.5y = -2.5 \times 7 + 40 = -17.5 + 40 = 22.5.
    3. No, firefighters do not cause damage. The confounding variable is the size of the fire. Larger fires cause more damage and also require more firefighters. The number of firefighters and the damage are both consequences of the fire’s severity.
    4. Line through (5,40)(5, 40) and (30,100)(30, 100): gradient =10040305=6025=2.4= \dfrac{100 - 40}{30 - 5} = \dfrac{60}{25} = 2.4. Using (5,40)(5, 40): 40=2.4×5+c40 = 2.4 \times 5 + c, so c=28c = 28. Equation: y=2.4x+28y = 2.4x + 28. For x=18x = 18: y=2.4×18+28=43.2+28=71.2y = 2.4 \times 18 + 28 = 43.2 + 28 = 71.2. Predicted sales: $71,200.
    5. At x=100x = 100: y=2.4×100+28=268y = 2.4 \times 100 + 28 = 268, predicting $268,000 in sales. This is extrapolation far beyond the data range (553030). The linear trend may not continue: there could be diminishing returns on advertising, market saturation, or budget constraints. The prediction is unreliable.

Tier 3

    1. The flattening suggests diminishing returns: beyond a certain point, additional study hours produce smaller improvements (perhaps due to fatigue or already knowing the material). A single straight line would overestimate marks at high hours and underestimate them in the middle range. A curve or two separate line segments would better fit the data.
    2. The claim is not always correct. If the true relationship is curved, a straight line in (A) may give poor predictions at the extremes despite appearing strong. Plot (B), if fitted with an appropriate curve, could give better predictions than a straight line forced onto (A). The best model matches the form of the data, not just the apparent strength.
    3. Observed association: data shows that students who eat breakfast tend to score higher on tests. Confounding variable: family income — wealthier families may provide both regular meals and better educational resources. Causal relationship: to establish that breakfast causes higher scores, you would need a controlled experiment where students are randomly assigned to eat or skip breakfast, with other factors held constant. Without this, the association may be driven by the confounding variable.
    4. y=0y = 0 when 2.5x+80=0-2.5x + 80 = 0, so x=32x = 32. For x>32x > 32, the line predicts negative yy values. Since x=32x = 32 is outside the data range (55 to 2525), these predictions are extrapolations. Negative values may be physically meaningless (e.g. you cannot have negative sales, negative height, etc.), confirming that extrapolation beyond the data range is unreliable.

Challenge

    1. Student A: gradient =5020155=3010=3= \dfrac{50 - 20}{15 - 5} = \dfrac{30}{10} = 3. Equation: y=3x+5y = 3x + 5. Student B: gradient =5614173=4214=3= \dfrac{56 - 14}{17 - 3} = \dfrac{42}{14} = 3. Equation: y=3x+5y = 3x + 5. Both lines have gradient 33 and yy-intercept 55 — they are actually the same line. If the yy-intercepts differed, you would trust the line whose reference points are closer to the centre of the data cloud, as it is less influenced by extreme points.
    2. (a) The outlier can “pull” the line of good fit toward it, tilting or shifting the line. (b) The strength of association decreases because the outlier increases the scatter around the line. (c) Predictions near the outlier become less reliable, and the line may give poorer predictions for the rest of the data if it has been pulled off course.
    3. The claim confuses correlation with causation. Confounding variables include: (i) GDP per capita — wealthier countries can afford both more mobile phones and better healthcare, nutrition, and sanitation. (ii) Education levels — higher education leads to both greater technology adoption and healthier lifestyles. A controlled experiment (randomly assigning mobile phones and measuring life expectancy) is impractical and ethically complex. Without controlling for confounders, we cannot conclude that mobile phones increase life expectancy.
    4. Predicted values: y(3)=3×3+12=21y(3) = 3 \times 3 + 12 = 21; y(5)=27y(5) = 27; y(7)=33y(7) = 33; y(9)=39y(9) = 39. Residuals: (2221)=1(22 - 21) = 1; (3027)=3(30 - 27) = 3; (3533)=2(35 - 33) = 2; (4439)=5(44 - 39) = 5. All residuals are positive and increasing, suggesting the line slightly underestimates yy values, and the underestimation grows for larger xx. This may indicate a slight curve (non-linearity) in the data, or that the gradient of the line is slightly too small.

Prefer paper? Print the answer key as a separate booklet: open print view ->