Scientific inquiry: variables, validity, argument

What you will learn

write an investigable question and a testable hypothesis,
identify independent, dependent, and controlled variables,
distinguish validity, reliability, and accuracy,
recognise random and systematic errors and sources of bias,
evaluate experimental design and construct a scientific argument from data.

Worked example 0 Real-world example: does caffeine improve reaction time?

A student plans to test whether caffeine improves reaction time.

Question: Does consuming caffeine reduce reaction time in 14-15 year olds?
Hypothesis: If caffeine raises alertness, then reaction time will be shorter after $100$ mg caffeine than after no caffeine.
Independent variable: caffeine dose ( $0$ mg, $100$ mg).
Dependent variable: reaction time (ms) on a standard online test.
Controlled variables: time of day, sleep before test, noise, test type, familiarity with the test.
Reliability: repeat the test multiple times per condition, average the results.
Validity: the reaction-time test must measure what we think it measures; caffeine must actually enter the bloodstream (give 20 min).

Key idea: one thing changes (IV), one thing is measured (DV), everything else is held constant. That is a fair test.

1. Questions and hypotheses

An investigable question is specific and answerable by an experiment, not a broad opinion or a value judgement.

Weak: “Is exercise good?”
Better: “Does 10 minutes of jogging raise resting heart rate more than 10 minutes of walking?”

A hypothesis is a testable prediction, usually in “if … then …” form, that states an expected direction or relationship.

“If surface area of a chemical increases, then reaction rate will increase, because more particles are exposed to collisions.”

Good hypotheses are:

Specific,
Falsifiable (could be shown wrong by evidence),
Linked to existing theory (“because …“).

2. Variables

Variable type	What it is	Example (plant growth test)
Independent (IV)	what you change	amount of sunlight per day
Dependent (DV)	what you measure	height after 2 weeks
Controlled	what you keep the same	soil, water, plant species, pot size, room temperature

A fair test changes only the IV and measures the DV, with everything else controlled. Only then can you reasonably attribute a change in DV to a change in IV.

3. Validity, reliability, accuracy

These three are often confused.

Term	What it asks	How to improve
Validity	Does the experiment actually measure what it claims to?	Control confounding variables; use a valid method
Reliability	Does repeating give similar results?	Take multiple readings; increase sample size
Accuracy	How close is a measurement to the true value?	Use calibrated instruments; minimise systematic error

4. Errors and bias

Random error: unpredictable fluctuations around the true value. Sources: reading scales, slight variation in timing.

Reduced by: repeating measurements and averaging.

Systematic error: a consistent bias in one direction. Sources: mis-calibrated instruments, parallax, incorrect zero.

Reduced by: calibrating, checking zero, using better equipment.

Bias: a skew in sampling or interpretation that makes results unrepresentative.

Sampling bias: surveying only your friends.
Observer bias: expecting a result and seeing it.
Reduced by: random sampling, blinding, pre-registering the hypothesis.

Worked example 1 Spotting an error

Five students measure the length of a bench with a ruler: $1.23$ m, $1.24$ m, $1.22$ m, $1.25$ m, $1.23$ m. Another student reads $1.10$ m. Which is likely a random error and which is a larger problem?

The five readings spread by $3$ cm — likely random error from reading the scale.
The $1.10$ m reading is an outlier, inconsistent with the others. Possibly a systematic error (mis-read starting from $13$ cm instead of $0$ ) or a mistake; it should be investigated, not averaged in blindly.

5. Designing a good experiment

A good experimental design includes:

Clear question and hypothesis with reasoning.
Identified IV, DV, controlled variables.
Control group (where appropriate) or baseline measurement.
Replication: repeat measurements or multiple trials.
Suitable range of values for the IV.
Calibrated and appropriate instruments.
Risk assessment — ethical and safety considerations.
Plan for recording data (tables) and processing (means, graphs).

Worked example 2 Improving a design

A student tests whether music speeds up homework. They do maths for 30 min with music and 30 min without. Critique the design.

Only one trial each — no replication.
No control of homework type (could be easier/harder).
Only one student — no sample; results may not generalise.
Time of day, fatigue not controlled.
DV (“faster”) is vague — should be problems per minute or accuracy.

Better: 20 students, randomised order of music/no-music, same standardised test, measured time and accuracy, repeated multiple times, results averaged.

6. Analysing data and drawing conclusions

Tabulate raw data with clear units and headings.
Summarise with a mean, and sometimes range or standard deviation to show spread.
Plot IV on the x-axis, DV on the y-axis.
Look for patterns, trends, and outliers.
Check that the conclusion actually follows from the data.

A scientific argument has three parts:

Claim: a statement that answers the question.
Evidence: data or observations that support the claim.
Reasoning: a link explaining why the evidence supports the claim, using scientific principles.

Worked example 3 Writing a conclusion

Data from a plant-light experiment: plants in 12 h light grew $8.4$ cm on average; in 6 h light grew $3.2$ cm; in 2 h light grew $0.9$ cm. Write a conclusion.

Claim: Increasing daily sunlight increased plant growth over two weeks.

Evidence: Mean growth rose from $0.9$ cm (2 h) to $3.2$ cm (6 h) to $8.4$ cm (12 h), a clear upward trend.

Reasoning: Plants use light for photosynthesis to make glucose. More hours of light allows more photosynthesis and more material for growth, consistent with the observed trend.

Note the conclusion should not over-reach: “plants” here means the species tested, “sunlight” means the lamp used, and the range tested was 2-12 h.

7. Evaluating a claim

When judging a scientific claim (or a news story), ask:

What was the sample size? Was the sample representative?
Was there a control or baseline?
Were confounding variables controlled?
Has the result been replicated by others?
Who funded or conducted the work? Could bias influence conclusions?
Does the claim go beyond what the data actually show?

Practice: Year 9

Fluency

Question, hypothesis, variables

Write an investigable question about how temperature affects the dissolving rate of salt.
Write a hypothesis for the above question in “if … then … because …” form.
For a test of “does fertiliser amount change tomato yield?”: identify the IV, DV, and three controlled variables.
Explain the difference between an IV and a DV.
State what a “control group” is and give an example.

Reasoning

Validity, reliability, accuracy

Define validity, reliability, and accuracy in your own words.
A thermometer reads $102^{\circ}\text{C}$ in boiling water at sea level (true value $100^{\circ}\text{C}$ ). Classify the error.
A stopwatch gives times of 12.40 s, 12.41 s, 12.39 s, 12.42 s. Classify: reliable? accurate? valid?
Give an example of an experiment that is reliable but not valid.
Why does repeating measurements improve reliability but not necessarily accuracy?

Problem solving

Designing and evaluating

A student investigates whether a ball dropped from higher bounces more. Design a plan: IV, DV, three controlled variables, what data to collect, and how to analyse.
Critique this design: “I tested a new fertiliser on my tomato plant. It grew taller than my neighbour’s tomato. Therefore the fertiliser works.” List three issues.
A company funds a study concluding its sugary drink is “not linked to weight gain”. Suggest two potential sources of bias and how to address them.
A class of 30 students has 28 results between 2.0 and 2.5 for an experiment. Two students report results of 8.7. Discuss whether to include or exclude the outliers and how to decide.

Reasoning

Arguments from data

A graph shows ice-cream sales and drowning rates rising together through summer. A headline reads “Ice cream causes drownings.” Evaluate this causal claim (hint: think about a common cause).
Data: reaction time (ms) after caffeine dose (mg): 0 -> 280, 50 -> 260, 100 -> 250, 150 -> 245, 200 -> 260. Describe the pattern and the most plausible interpretation.
Write a three-part argument (claim, evidence, reasoning) for: “a LED bulb is more efficient than an incandescent bulb”, using typical figures from your topic knowledge.

Challenge

Reasoning

Harder reasoning

A medical trial uses “double-blind” design: neither patient nor doctor knows who got the drug or placebo. Explain why this controls bias, and what would go wrong if either side knew.
Two studies disagree about the effect of a new diet. Study A: $n = 15$ , $12$ weeks, self-reported weight. Study B: $n = 500$ , $6$ months, weighed by researchers. Using the ideas of validity, reliability, and sample size, argue which result deserves more weight.
A student claims their experiment “proves” their hypothesis. Explain why science never “proves” a hypothesis, only supports or falsifies it — and why that makes science more trustworthy, not less.
A graph of test scores vs hours studied shows scatter but a clear upward trend. Write a balanced conclusion that distinguishes correlation from causation and identifies at least one confounding variable.

Answers

Answer key

Attempt the practice first. When you're ready to check, expand the answers below.

Show the full answer key

Year 9 answers

Fluency

Question, hypothesis, variables

E.g. “Does increasing water temperature reduce the time for $5$ g of salt to dissolve in $100$ mL of water?”
“If water temperature is increased, then the time for salt to dissolve will decrease, because particles move faster at higher temperature, giving more frequent collisions with the solvent.”
IV: mass of fertiliser per plant. DV: tomato yield (mass or count of fruit). Controlled: variety of tomato, size of pot, amount of water, light exposure, soil type, duration of experiment.
The IV is the variable the experimenter changes; the DV is what is measured and is expected to respond to changes in the IV.
A control group is a comparison group that does not receive the treatment/change, showing what happens without the IV. Example: placebo group in a drug trial.

Reasoning

Validity, reliability, accuracy

Validity: the experiment measures what it claims to measure. Reliability: repeated measurements give consistent results. Accuracy: measurements are close to the true value.
Systematic error — off by a consistent $+2^{\circ}\text{C}$ .
Reliable (very consistent), reasonably accurate if the true time is near $12.40$ s. Validity depends on whether timing actually measures what we want.
E.g. using a cheap bathroom scale that always reads $2$ kg too low — gives repeatable (reliable) but inaccurate readings; still not valid for a “true weight” study.
Repeating averages out random fluctuations, improving reliability. It does not fix systematic errors, which push every reading the same way.

Problem solving

Designing and evaluating

IV: drop height (e.g. 25, 50, 75, 100, 125 cm). DV: bounce height (cm) from the floor to top of first bounce. Controlled: same ball, same surface, same ball release technique (no push), same temperature, same measurer. Data: measure bounce height 3 times per drop height; average. Analysis: plot bounce height (y) vs drop height (x); look for a linear trend and comment on outliers.
Issues: (i) $n = 1$ ; no replication; (ii) no control (different plants, different conditions — uncontrolled confounds); (iii) a single outcome (taller) doesn’t prove the fertiliser is responsible; (iv) no randomisation; (v) no measure of variability.
Bias sources: (i) selective reporting of favourable results; (ii) study design choices favouring the sponsor (short duration, specific group). Address by independent replication, pre-registering the study, and full public data access.
Investigate first: were the two outliers from a procedural mistake (e.g. different method)? If yes, exclude and state this. If there’s no mistake, keep them but report them — they may reflect real variation. Use a consistent rule (e.g. outlier test) rather than discarding data to make the result look cleaner.

Reasoning

Arguments from data

Correlation does not imply causation. Both ice-cream sales and drownings rise in summer because of higher temperatures and more swimming; the common cause is hot weather, not ice cream.
Reaction time falls from 280 ms (0 mg) to 245 ms (150 mg), showing caffeine shortens reaction time up to a point. At 200 mg it rises again (260 ms), suggesting a “too much” effect (jitteriness, over-arousal). Plausible interpretation: moderate doses improve alertness; high doses may impair.
Claim: LED bulbs are more efficient than incandescents. Evidence: typical LED outputs $\sim 60\%$ of input as visible light vs $\sim 5\%$ for incandescents; LEDs use $10$ W to produce about the same light as a $60$ W incandescent. Reasoning: both convert electrical input into light and heat; LEDs use semiconductor electroluminescence, which diverts little energy to heat, while incandescents rely on a heated filament where most energy becomes heat. Therefore, for the same useful light, LEDs use far less electrical input, which is the definition of higher efficiency.

Reasoning

Challenge

In a double-blind trial, neither side can consciously or unconsciously influence outcomes. If doctors knew, they might treat the drug group differently (more attentive care, interpret symptoms differently); if patients knew, the placebo effect and reporting would differ. Double-blinding removes both channels of bias.
Study B deserves more weight: much larger $n$ (better reliability), researcher-measured weight (more accurate and valid than self-report), longer duration (captures real effects). Study A’s small sample and self-reported DV make both reliability and validity weaker.
“Proof” in everyday use means certainty. In science, no finite set of observations can establish certainty — there may always be a future experiment that contradicts a theory. Science “supports” hypotheses tentatively and is open to revision. Paradoxically, this is a strength: self-correction is why science advances, while dogma that claims proof cannot be improved.
Claim: higher hours of study are associated with higher test scores. Evidence: positive trend on the graph. Reasoning: more time on content could plausibly improve retention and skill. But correlation is not causation; confounding variables (e.g. motivation, sleep, subject aptitude, prior knowledge) might cause both more study and better scores. A controlled experiment or statistical control is needed to distinguish the effect of study time itself.

Prefer paper? Print the answer key as a separate booklet: open print view ->