Data analysis and distributions

(a) positively skewed, (b) bimodal, (c) symmetric, (d) negatively skewed.
Mean $= \dfrac{85}{8} = 10.625$ . Median $= \dfrac{7+7}{2} = 7$ . Range $= 40 - 3 = 37$ .
Without $40$ : mean $= \dfrac{45}{7} \approx 6.43$ , median $= 7$ , range $= 9 - 3 = 6$ . The range changed most (from $37$ to $6$ ), followed by the mean (from $10.625$ to $6.43$ ). The median barely changed.
The median ( $7$ ) better represents the typical value because the outlier ( $40$ ) inflates the mean.
Systematic sampling.
People at a train station are more likely to prefer trains, so train travel would be overrepresented. People who drive, cycle, or walk are less likely to be at the station.
Stem | Leaf: $1$ | $4\;8$ , $2$ | $2\;5\;7$ , $3$ | $1\;3\;6\;8$ , $4$ | $2$ .
A back-to-back stem-and-leaf plot or side-by-side box plots would both work well for comparing two numerical distributions.
The mean is higher than the median in a positively skewed distribution (the tail of high values pulls the mean up).
The bars rise then fall with a single peak, so the distribution is approximately symmetric (or very slightly positively skewed if the tail on the right is longer).

Back-to-back stem-and-leaf: Stem $0$ : Class A leaves $8\;7\;6\;5\;5\;4\;3\;3\;2$ | Class B leaves $1\;2\;4\;5\;5\;6\;6\;7\;7\;8$ . Stem $1$ : Class A leaf $2$ | Class B (none). Class A has a wider spread (range $10$ vs $7$ ) with an outlier at $12$ . Class B is more tightly clustered. Medians are similar (A: $5$ , B: $5.5$ ).
Positively skewed. The mean ( $24$ ) is greater than the median ( $18$ ), which indicates a tail of high values pulling the mean up.
Total $= 2000$ . Proportions: Year 7 $= \dfrac{800}{2000} \times 200 = 80$ , Year 8 $= \dfrac{700}{2000} \times 200 = 70$ , Year 9 $= \dfrac{500}{2000} \times 200 = 50$ .
House prices are typically positively skewed (a few very expensive houses push the mean up). The median gives a better sense of what a “typical” house costs because it is not affected by the extreme values.
Morning: mean $= 46$ , median $= 46$ , range $= 8$ . Afternoon: mean $\approx 48.9$ , median $= 48$ , range $= 15$ . The afternoon shift is slightly slower on average and has more variation, possibly due to the outlier at $58$ .
Example: heights of a mixed group of adult men and women. The two peaks correspond to the average female height and the average male height — two overlapping subpopulations create bimodality.
A sample of $10$ friends is a convenience sample that is not random. Friends tend to share interests, backgrounds, and demographics, so the sample is likely biased and not representative of the whole school. A random or stratified sample would be more reliable.
When comparing two distributions, comment on: (i) centre (mean or median), (ii) spread (range or IQR), and (iii) shape (symmetric, skewed, or bimodal). Also note any outliers.

The CEO’s salary of $800,000 pulls the mean up. If the other $19$ earn an average of $60,000, the total is $19 \times 60\,000 + 800\,000 = 1\,940\,000$ , giving mean $= \dfrac{1\,940\,000}{20} = 97\,000$ dollars. The “average” (mean) is close to $95,000 but the median is around $60,000. Most employees earn far less than the reported average. The company uses the mean to create a misleading impression.
Question: “Do Year 9 students spend more time per week on homework than Year 7 students?” Sampling: stratified random sample of $30$ students from each year level. Variables: year level (categorical), homework hours per week (continuous). Display: side-by-side box plots or back-to-back stem-and-leaf plot. Calculate mean and median for each group and compare.
Example: Histogram 1 is symmetric (bell-shaped). Histogram 2 is bimodal with one peak below the mean and one above. Both can have the same mean (balanced around the centre) and the same range (same min and max) but very different shapes. The bimodal histogram has more data at the extremes and less near the centre.
Original sum $= 20 \times 30 = 600$ . New sum $= 600 + 80 = 680$ . New mean $= \dfrac{680}{21} \approx 32.4$ . The mean increased by $2.4$ . The median changes from the average of the 10th and 11th values to the 11th value — it might increase by only $0$ or $1$ , making it more stable and representative.
A population is the entire group of interest; a sample is a subset selected for study. Example: surveying every one of Australia’s $\approx 26$ million residents about exercise habits is impractical due to cost and time. A representative sample of a few thousand provides useful estimates instead.

Combined sum $= n\bar{x}_A + n\bar{x}_B$ . Combined count $= 2n$ . Combined mean $= \dfrac{n\bar{x}_A + n\bar{x}_B}{2n} = \dfrac{\bar{x}_A + \bar{x}_B}{2}$ . For different sizes: combined mean $= \dfrac{n_A \bar{x}_A + n_B \bar{x}_B}{n_A + n_B}$ , which is a weighted average of the two means.
(a) Mean increases by $c$ (every value increases by $c$ , so the sum increases by $nc$ , and the mean by $c$ ). (b) Median increases by $c$ (the middle value shifts by $c$ ). (c) Range is unchanged (max and min both increase by $c$ , so their difference is the same). (d) Standard deviation is unchanged (deviations from the mean are the same since both each value and the mean shift by $c$ ).
One possible set: $30, 35, 38, 40, 44, 46, 50, 55, 62, 100$ . Sum $= 500$ , mean $= 50$ . Median $= \dfrac{44+46}{2} = 45$ . The high value $100$ creates a right tail, giving positive skew. Mean $>$ median, confirming positive skewness.
Proportions: Year 7 $= \dfrac{350}{1200} \times 120 = 35$ , Year 8 $= \dfrac{320}{1200} \times 120 = 32$ , Year 9 $= \dfrac{280}{1200} \times 120 = 28$ , Year 10 $= \dfrac{250}{1200} \times 120 = 25$ . The student who scored $0$ was absent, not genuinely scoring zero. This value should be treated as missing data and excluded from analysis (or the student should be resurveyed). Including it would unfairly lower Year 10’s statistics and misrepresent that year level’s performance.

Tier 1

Tier 2

Tier 3

Challenge