What you will learn
- explain what bivariate data is and why we use scatterplots,
- construct scatterplots from paired data,
- describe the association between two variables: strength, direction, and form,
- draw a line of good fit by eye and use it for predictions,
- distinguish between interpolation and extrapolation (and their limitations),
- explain why correlation does not imply causation.
A teacher records the hours studied and test marks (out of ) for students:
| Hours | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| Mark | 15 | 20 | 22 | 28 | 30 | 35 | 38 | 42 |
- Plot each pair as a point: .
- The points rise from left to right: positive association.
- They lie close to a straight line: strong, linear association.
- A line of good fit passes through approximately and .
- Gradient . Equation: , so .
Key idea: we can use the line to predict marks for a given number of hours — but only within the data range.
1. What is bivariate data?
Bivariate data consists of pairs of measurements on two variables for each individual or observation. The explanatory variable (independent) is plotted on the horizontal axis, and the response variable (dependent) is plotted on the vertical axis.
2. Describing association
When describing a scatterplot, address three features:
| Feature | Options |
|---|---|
| Direction | Positive (upward trend) or negative (downward trend) |
| Strength | Strong (points close to a line), moderate, or weak (points widely scattered) |
| Form | Linear (straight-line pattern) or non-linear (curved pattern) |
If there is no discernible pattern, we say there is no association.
A scatterplot of car age (years) vs resale value ($) shows points falling from left to right. The points cluster tightly around a curve that flattens out for older cars.
Description: there is a strong, negative, non-linear association between car age and resale value. As car age increases, resale value decreases, but the rate of decrease slows for older cars.
3. Line of good fit by eye
A line of good fit (or trend line) is a straight line drawn through the data that best represents the overall pattern. When drawing by eye:
- The line should pass through (or close to) the middle of the data cloud.
- Roughly equal numbers of points should be above and below the line.
- The line should follow the direction and slope of the data.
Once you have two points on the line, find the equation using gradient-intercept form:
where is the gradient.
A line of good fit passes through the points and .
- Gradient: .
- Using : , so .
- Equation: .
- Predict the value when : .
4. Interpolation vs extrapolation
- Interpolation: predicting within the range of the data. This is generally reliable.
- Extrapolation: predicting outside the range of the data. This is risky because the trend may not continue.
Data covers hours studied from to , with the line .
- Predict the mark for hours: . This is interpolation (within range) — reliable.
- Predict the mark for hours: . This is extrapolation (outside range) — unreliable. The test is out of , so is impossible.
Key idea: always check whether a prediction falls within the data range before trusting it.
5. Correlation vs causation
A strong association (correlation) between two variables does not mean one causes the other. There may be:
- a confounding variable (a third variable that influences both),
- reverse causation (the direction of influence is opposite to what was assumed),
- coincidence (the association is purely by chance).
To establish causation, you need a properly designed experiment with a control group.
Data shows a strong positive correlation between ice-cream sales and drowning incidents.
Does ice cream cause drowning? No. The confounding variable is temperature: hot weather increases both ice-cream sales and swimming activity, which increases drowning risk. Ice cream and drowning are correlated but not causally linked.
Practice
Tier 1: basic skills
- Define bivariate data and give an example of two variables you might investigate.
- Which variable goes on the horizontal axis: the explanatory variable or the response variable?
- A scatterplot shows points rising steeply from left to right with little scatter. Describe the association.
- A scatterplot shows points scattered randomly with no pattern. Describe the association.
- A line of good fit passes through and . Find the gradient.
- Using the line from Q5, find the equation and predict when .
- Is predicting for (data range —) interpolation or extrapolation?
- Is predicting for interpolation or extrapolation?
- State whether each is positive or negative association: (a) height and shoe size, (b) altitude and temperature, (c) practice hours and error count.
- True or false: a strong correlation between two variables proves that one causes the other.
Tier 2: mixed practice
-
Plot the following data on a scatterplot and describe the association:
2 4 6 8 10 12 35 30 24 20 14 10 -
Draw a line of good fit for the data in Q1, find its equation, and predict when .
-
A researcher finds a strong positive correlation between the number of firefighters at a fire and the amount of damage caused. Does this mean firefighters cause damage? Explain.
-
Data on advertising spend ($‘000) and sales ($‘000) for months is:
Advertising 5 10 15 20 25 30 Sales 40 55 65 80 90 100 Find the equation of the line of good fit and predict sales for an advertising spend of $18,000.
-
Explain why extrapolating the line from Q4 to predict sales for $100,000 in advertising is unreliable.
Tier 3: explain and apply
- A scatterplot of study hours vs exam mark shows a strong positive linear association for — hours, but the points flatten out beyond hours. Explain this pattern and discuss the limitations of using a single straight line for the entire data set.
- Two scatterplots are shown: (A) shows a strong linear pattern; (B) shows a moderate curved pattern. A student claims that (A) always provides better predictions. Evaluate this claim.
- Explain the difference between an observed association, a confounding variable, and a causal relationship. Use a real-world example to illustrate all three concepts.
- A line of good fit has the equation . The data ranges from to . For what values of does the line predict negative values? Explain why these predictions are meaningless.
Challenge
Harder reasoning
- Two students draw different lines of good fit for the same scatterplot. Student A’s line passes through and . Student B’s line passes through and . Show that both lines have the same gradient but different -intercepts. Which line would you trust more and why?
- A data set of points has a strong positive linear association. One additional point is added far from the trend (an outlier). Describe how this outlier could affect (a) the position of the line of good fit, (b) the strength of the association, and (c) predictions made using the line.
- A study finds that countries with more mobile phones per person also have higher life expectancy. A journalist writes “Mobile phones increase life expectancy.” Write a critique of this claim, identifying at least two confounding variables and explaining why a controlled experiment would be needed.
- The residual for a data point is defined as . For the data and the line , calculate each residual. What does the pattern of residuals tell you about the fit?