Scatterplots and bivariate data

Bivariate data consists of pairs of measurements on two variables for each individual. Example: height (cm) and weight (kg) for each student in a class.
The explanatory (independent) variable goes on the horizontal axis.
Strong, positive, linear association.
No association (no pattern).
Gradient $= \dfrac{28 - 10}{8 - 2} = \dfrac{18}{6} = 3$ .
Using $(2, 10)$ : $10 = 3 \times 2 + c$ , so $c = 4$ . Equation: $y = 3x + 4$ . When $x = 5$ : $y = 3 \times 5 + 4 = 19$ .
Interpolation (5 is within the range 2—8).
Extrapolation (12 is outside the range 2—8).
(a) Positive. (b) Negative. (c) Negative.
False. Correlation does not prove causation.

The scatterplot shows a strong, negative, linear association. As $x$ increases, $y$ decreases steadily.
A line of good fit through approximately $(2, 35)$ and $(12, 10)$ gives gradient $= \dfrac{10 - 35}{12 - 2} = \dfrac{-25}{10} = -2.5$ . Using $(2, 35)$ : $35 = -2.5 \times 2 + c$ , so $c = 40$ . Equation: $y = -2.5x + 40$ . When $x = 7$ : $y = -2.5 \times 7 + 40 = -17.5 + 40 = 22.5$ .
No, firefighters do not cause damage. The confounding variable is the size of the fire. Larger fires cause more damage and also require more firefighters. The number of firefighters and the damage are both consequences of the fire’s severity.
Line through $(5, 40)$ and $(30, 100)$ : gradient $= \dfrac{100 - 40}{30 - 5} = \dfrac{60}{25} = 2.4$ . Using $(5, 40)$ : $40 = 2.4 \times 5 + c$ , so $c = 28$ . Equation: $y = 2.4x + 28$ . For $x = 18$ : $y = 2.4 \times 18 + 28 = 43.2 + 28 = 71.2$ . Predicted sales: $71,200.
At $x = 100$ : $y = 2.4 \times 100 + 28 = 268$ , predicting $268,000 in sales. This is extrapolation far beyond the data range ( $5$ — $30$ ). The linear trend may not continue: there could be diminishing returns on advertising, market saturation, or budget constraints. The prediction is unreliable.

The flattening suggests diminishing returns: beyond a certain point, additional study hours produce smaller improvements (perhaps due to fatigue or already knowing the material). A single straight line would overestimate marks at high hours and underestimate them in the middle range. A curve or two separate line segments would better fit the data.
The claim is not always correct. If the true relationship is curved, a straight line in (A) may give poor predictions at the extremes despite appearing strong. Plot (B), if fitted with an appropriate curve, could give better predictions than a straight line forced onto (A). The best model matches the form of the data, not just the apparent strength.
Observed association: data shows that students who eat breakfast tend to score higher on tests. Confounding variable: family income — wealthier families may provide both regular meals and better educational resources. Causal relationship: to establish that breakfast causes higher scores, you would need a controlled experiment where students are randomly assigned to eat or skip breakfast, with other factors held constant. Without this, the association may be driven by the confounding variable.
$y = 0$ when $-2.5x + 80 = 0$ , so $x = 32$ . For $x > 32$ , the line predicts negative $y$ values. Since $x = 32$ is outside the data range ( $5$ to $25$ ), these predictions are extrapolations. Negative values may be physically meaningless (e.g. you cannot have negative sales, negative height, etc.), confirming that extrapolation beyond the data range is unreliable.

Student A: gradient $= \dfrac{50 - 20}{15 - 5} = \dfrac{30}{10} = 3$ . Equation: $y = 3x + 5$ . Student B: gradient $= \dfrac{56 - 14}{17 - 3} = \dfrac{42}{14} = 3$ . Equation: $y = 3x + 5$ . Both lines have gradient $3$ and $y$ -intercept $5$ — they are actually the same line. If the $y$ -intercepts differed, you would trust the line whose reference points are closer to the centre of the data cloud, as it is less influenced by extreme points.
(a) The outlier can “pull” the line of good fit toward it, tilting or shifting the line. (b) The strength of association decreases because the outlier increases the scatter around the line. (c) Predictions near the outlier become less reliable, and the line may give poorer predictions for the rest of the data if it has been pulled off course.
The claim confuses correlation with causation. Confounding variables include: (i) GDP per capita — wealthier countries can afford both more mobile phones and better healthcare, nutrition, and sanitation. (ii) Education levels — higher education leads to both greater technology adoption and healthier lifestyles. A controlled experiment (randomly assigning mobile phones and measuring life expectancy) is impractical and ethically complex. Without controlling for confounders, we cannot conclude that mobile phones increase life expectancy.
Predicted values: $y(3) = 3 \times 3 + 12 = 21$ ; $y(5) = 27$ ; $y(7) = 33$ ; $y(9) = 39$ . Residuals: $(22 - 21) = 1$ ; $(30 - 27) = 3$ ; $(35 - 33) = 2$ ; $(44 - 39) = 5$ . All residuals are positive and increasing, suggesting the line slightly underestimates $y$ values, and the underestimation grows for larger $x$ . This may indicate a slight curve (non-linearity) in the data, or that the gradient of the line is slightly too small.

Tier 1

Tier 2

Tier 3

Challenge