Measures of central tendency
Measures of central tendency
- Mean (average) is the sum of all values divided by the number of values.
- Median is the middle value of a dataset.
- Mode is the most frequent value in a dataset.
Data: Palmer penguins
species |
island |
bill_length_mm |
bill_depth_mm |
flipper_length_mm |
body_mass_g |
sex |
year |
Adelie |
Dream |
36.0 |
17.9 |
190 |
3450 |
female |
2007 |
Chinstrap |
Dream |
50.9 |
17.9 |
196 |
3675 |
female |
2009 |
Gentoo |
Biscoe |
46.1 |
13.2 |
211 |
4500 |
female |
2007 |
Adelie |
Torgersen |
45.8 |
18.9 |
197 |
4150 |
male |
2008 |
Gentoo |
Biscoe |
48.6 |
16.0 |
230 |
5800 |
male |
2008 |
Arithmetic mean
\[
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
\]
Body mass = 3450, 3675, 4500, 4150, 5800 g
\(\overline{\text{body mass}} = \frac{3450 + 3675 + 4500 + 4150 + 5800}{5} =\) 4315
Other means
- Geometric mean: The \(n\)-th root of the product of \(n\) values.
\[
\bar{x} = \sqrt[n]{x_1 \cdot x_2 \cdot \ldots \cdot x_n}
\]
\(\overline{\text{body mass}} = \sqrt[5]{3450 \cdot 3675 \cdot 4500 \cdot 4150 \cdot 5800} =\) 4241.83
- Quadratic mean (root mean square): The square root of the average of the squares of the values.
\[
\bar{x} = \sqrt{\frac{x_1^2 + x_2^2 + \ldots + x_n^2}{n}}
\]
\(\overline{\text{body mass}} = \sqrt{\frac{3450^2 + 3675^2 + 4500^2 + 4150^2 + 5800^2}{5}} =\) 4393.65
Other means (cont.)
- Trimmed mean: The mean of the dataset after removing a certain percentage of the smallest and largest values.
\[
\bar{x} = \frac{1}{n - 2p} \sum_{i=p+1}^{n-p} x_i
\]
\(\overline{\text{body mass}} = \frac{\color{Red}{3450 +} 3675 + 4500 + 4150 \color{Red}{ + 5800}}{3} = \frac{3675 + 4500 + 4150}{3} =\) 4108.33
Mode
Mode is the most frequent value in a dataset.
Adelie |
Dream |
36.0 |
17.9 |
190 |
3450 |
female |
2007 |
Chinstrap |
Dream |
50.9 |
17.9 |
196 |
3675 |
female |
2009 |
Gentoo |
Biscoe |
46.1 |
13.2 |
211 |
4500 |
female |
2007 |
Adelie |
Torgersen |
45.8 |
18.9 |
197 |
4150 |
male |
2008 |
Gentoo |
Biscoe |
48.6 |
16.0 |
230 |
5800 |
male |
2008 |
Letβs find the mode of the species
variable of full penguins
dataset.
species |
n |
Adelie |
152 |
Gentoo |
124 |
Chinstrap |
68 |
Measures of dispersion
- Range: Difference between the maximum and minimum values.
- Variance: Average of squared differences from the mean.
- Standard deviation: Square root of the variance.
- Interquartile range (IQR): Difference between the 75th and 25th percentiles.
Range
Body Mass: 3450, 3675, 4500, 4150, 5800
\[
\text{Range} = \text{Max} - \text{Min} = 5800 - 3450 = 2350
\]
Variance
For population: \[
\text{Variance} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2
\]
For sample: \[
\text{Variance} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2
\]
Body Mass: 3450, 3675, 4500, 4150, 5800
\[
\begin{aligned}
\text{Variance} &= \frac{(3450 - 4315)^2 + (3675 - 4315)^2 + (4500 - 4315)^2 + (4150 - 4315)^2 + (5800 - 4315)^2}{5 - 1} \\
&= \frac{3424500}{4} = 856125
\end{aligned}
\]
π€¨
Standard deviation
\[
\text{Standard deviation} = \sqrt{\text{Variance}}
\]
Body Mass: 3450, 3675, 4500, 4150, 5800
\[
\text{Standard deviation} = \sqrt{856125} = 925.27
\]
Interpretation: On average, the body mass of penguins in the sample differs from the mean by 925.27 g.
Quartiles
Quartiles are values that divide a dataset into four equal parts.
- Q1: 25th percentile.
- Q2: 50th percentile (median).
- Q3: 75th percentile.
Body Mass: \(3450, 3575, \underset{\color{#e64173}{Q_1}}{\color{#e64173}{3675}}, \underset{\color{#e64173}{Q_2} = \text{Median}}{\underbrace{3900, 3915}}, \underset{\color{#e64173}{Q_3}}{\color{#e64173}{4500}}, 4150, 5800.\)
Interpretation:
- \(Q_1 = 3675\): 25% of penguins have a body mass below 3675 g.
- \(Q_3 = 4500\): 75% of penguins have a body mass below 4500 g.
Interquartile range
- Interquartile range (IQR) is the difference between the Q3 and Q1 quartiles.
Body Mass: 3450, 3575, 3675, 3900, 3915, 4500, 4150, 5800.
\[
\begin{aligned}
Q_1 &= 3675 \\
Q_3 &= 4500 \\
\text{IQR} &= Q_3 - Q_1 = 4500 - 3675 = 825
\end{aligned}
\]
Interpretation: 50% of penguins have a body mass between 3675 and 4500 g. And the middle 50% of penguins have a body mass range of 825 g.
Central tendency vs. dispersion
- Central tendency measures describe the center of a dataset.
- Dispersion measures describe the spread of a dataset.