What is statistics?

Probability and Statistics

Ihor Miroshnychenko

Kyiv School of Economics

Types of statistics

Bayesian vs. Frequentist statistics

Frequentist statistics

  • Based on the frequency of events.

β€œIf you repeat an experiment many times, the frequency with which an event occurs will converge to the true probability of the event.”

When I flip a coin many many (infinite) times, the frequency of heads will converge to 0.5.

Bayesian statistics

  • Based on prior knowledge and probability.

β€œThe probability of an event is the degree of belief that the event will occur.”


I believe that the probability of heads is 0.5.

Population vs. sample

Population

πŸ‘¦πŸ‘¦πŸ»πŸ‘¦πŸΌπŸ‘¦πŸ½πŸ‘¦πŸΎπŸ‘¦πŸΏπŸ‘§πŸ‘§πŸ»πŸ‘§πŸΌπŸ‘§πŸ½πŸ‘§πŸΎπŸ‘§πŸΏπŸ‘¦πŸ‘¦πŸ»πŸ‘¦πŸΌπŸ‘¦πŸ½πŸ‘¦πŸΎπŸ‘¦πŸΏπŸ‘§πŸ‘§πŸ»πŸ‘§πŸΌπŸ‘§πŸ½πŸ‘§πŸΎπŸ‘§πŸΏπŸ‘¦πŸ‘¦πŸ»πŸ‘¦πŸΌπŸ‘¦πŸ½πŸ‘¦πŸΎπŸ‘¦πŸΏπŸ‘§πŸ‘§πŸ»πŸ‘§πŸΌπŸ‘§πŸ½πŸ‘§πŸΎπŸ‘§πŸΏ

Sampling β†’



Inferential statistics ←

Sample

πŸ‘¦πŸ‘¦πŸ»β€ƒπŸ‘¦πŸ½β€ƒπŸ‘¦πŸΏπŸ‘§πŸ‘§πŸ»πŸ‘§πŸΌβ€ƒπŸ‘§πŸΎβ€ƒπŸ‘¦πŸ‘¦πŸ»β€ƒπŸ‘¦πŸ½πŸ‘¦πŸΎβ€ƒπŸ‘§πŸ‘§πŸ»πŸ‘§πŸΌβ€ƒπŸ‘§πŸΎβ€ƒπŸ‘¦β€ƒπŸ‘¦πŸΌπŸ‘¦πŸ½β€ƒπŸ‘¦πŸΏβ€ƒπŸ‘§πŸ»πŸ‘§πŸΌβ€ƒπŸ‘§πŸΎπŸ‘§πŸΏ

Descriptive statistics

Descriptive statistics

  • Descriptive statistics summarize and describe the main features of a dataset.

The various sub-areas of descriptive statistics can be summarized as follows:

  1. Measures of frequency: Count, percent, frequency.
  2. Measures of central tendency: Mean, median, mode.
  3. Measures of dispersion: Range, variance, standard deviation.
  4. Measures of position: Percentile ranks, quartile ranks.

Measures of central tendency

Measures of central tendency

  • Mean (average) is the sum of all values divided by the number of values.
  • Median is the middle value of a dataset.
  • Mode is the most frequent value in a dataset.

Data: Palmer penguins

Palmer penguins dataset
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Dream 36.0 17.9 190 3450 female 2007
Chinstrap Dream 50.9 17.9 196 3675 female 2009
Gentoo Biscoe 46.1 13.2 211 4500 female 2007
Adelie Torgersen 45.8 18.9 197 4150 male 2008
Gentoo Biscoe 48.6 16.0 230 5800 male 2008

Arithmetic mean

\[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]

Body mass = 3450, 3675, 4500, 4150, 5800 g


\(\overline{\text{body mass}} = \frac{3450 + 3675 + 4500 + 4150 + 5800}{5} =\) 4315

Other means

  • Geometric mean: The \(n\)-th root of the product of \(n\) values.

\[ \bar{x} = \sqrt[n]{x_1 \cdot x_2 \cdot \ldots \cdot x_n} \]

\(\overline{\text{body mass}} = \sqrt[5]{3450 \cdot 3675 \cdot 4500 \cdot 4150 \cdot 5800} =\) 4241.83

  • Quadratic mean (root mean square): The square root of the average of the squares of the values.

\[ \bar{x} = \sqrt{\frac{x_1^2 + x_2^2 + \ldots + x_n^2}{n}} \]

\(\overline{\text{body mass}} = \sqrt{\frac{3450^2 + 3675^2 + 4500^2 + 4150^2 + 5800^2}{5}} =\) 4393.65

Other means (cont.)

  • Trimmed mean: The mean of the dataset after removing a certain percentage of the smallest and largest values.

\[ \bar{x} = \frac{1}{n - 2p} \sum_{i=p+1}^{n-p} x_i \]

\(\overline{\text{body mass}} = \frac{\color{Red}{3450 +} 3675 + 4500 + 4150 \color{Red}{ + 5800}}{3} = \frac{3675 + 4500 + 4150}{3} =\) 4108.33

Median

  • Odd number of values: Middle value.
  • Even number of values: Average of two middle values.

Body Mass (odd): 3450, 3675, 4500, 4150, 5800

  1. Arrange in ascending order: 3450, 3675, 4150, 4500, 5800
  2. Median: 4150

Body Mass (even): 3450, 3675, 4500, 4150, 5800, 3900

  1. Arrange in ascending order: 3450, 3675, 3900, 4150, 4500, 5800
  2. Median: \(\frac{3900 + 4150}{2} = 4025\)
Median β€” robust to outliers!

Mode

Mode is the most frequent value in a dataset.

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Dream 36.0 17.9 190 3450 female 2007
Chinstrap Dream 50.9 17.9 196 3675 female 2009
Gentoo Biscoe 46.1 13.2 211 4500 female 2007
Adelie Torgersen 45.8 18.9 197 4150 male 2008
Gentoo Biscoe 48.6 16.0 230 5800 male 2008

Let’s find the mode of the species variable of full penguins dataset.

Penguin species
species n
Adelie 152
Gentoo 124
Chinstrap 68
  • Mode = 152

Measures of dispersion

Measures of dispersion

  • Range: Difference between the maximum and minimum values.
  • Variance: Average of squared differences from the mean.
  • Standard deviation: Square root of the variance.
  • Interquartile range (IQR): Difference between the 75th and 25th percentiles.

Range

Body Mass: 3450, 3675, 4500, 4150, 5800

\[ \text{Range} = \text{Max} - \text{Min} = 5800 - 3450 = 2350 \]

Variance

For population: \[ \text{Variance} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]

For sample: \[ \text{Variance} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]


Body Mass: 3450, 3675, 4500, 4150, 5800

\[ \begin{aligned} \text{Variance} &= \frac{(3450 - 4315)^2 + (3675 - 4315)^2 + (4500 - 4315)^2 + (4150 - 4315)^2 + (5800 - 4315)^2}{5 - 1} \\ &= \frac{3424500}{4} = 856125 \end{aligned} \]

🀨

Standard deviation

\[ \text{Standard deviation} = \sqrt{\text{Variance}} \]

Body Mass: 3450, 3675, 4500, 4150, 5800

\[ \text{Standard deviation} = \sqrt{856125} = 925.27 \]

Interpretation: On average, the body mass of penguins in the sample differs from the mean by 925.27 g.

Quartiles

Quartiles are values that divide a dataset into four equal parts.

  • Q1: 25th percentile.
  • Q2: 50th percentile (median).
  • Q3: 75th percentile.

Body Mass: \(3450, 3575, \underset{\color{#e64173}{Q_1}}{\color{#e64173}{3675}}, \underset{\color{#e64173}{Q_2} = \text{Median}}{\underbrace{3900, 3915}}, \underset{\color{#e64173}{Q_3}}{\color{#e64173}{4500}}, 4150, 5800.\)

Interpretation:

  • \(Q_1 = 3675\): 25% of penguins have a body mass below 3675 g.
  • \(Q_3 = 4500\): 75% of penguins have a body mass below 4500 g.

Interquartile range

  • Interquartile range (IQR) is the difference between the Q3 and Q1 quartiles.

Body Mass: 3450, 3575, 3675, 3900, 3915, 4500, 4150, 5800.

\[ \begin{aligned} Q_1 &= 3675 \\ Q_3 &= 4500 \\ \text{IQR} &= Q_3 - Q_1 = 4500 - 3675 = 825 \end{aligned} \]

Interpretation: 50% of penguins have a body mass between 3675 and 4500 g. And the middle 50% of penguins have a body mass range of 825 g.

Central tendency vs. dispersion

  • Central tendency measures describe the center of a dataset.
  • Dispersion measures describe the spread of a dataset.

Tables

Absolute and relative frequencies

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Gentoo Biscoe 44.9 13.8 212 4750 female 2009
Adelie Biscoe 40.6 18.6 183 3550 male 2007
Gentoo Biscoe 45.4 14.6 211 4800 female 2007
Gentoo Biscoe 52.1 17.0 230 5550 male 2009
Chinstrap Dream 52.0 20.7 210 4800 male 2008
Gentoo Biscoe 44.9 13.3 213 5100 female 2008
Adelie Biscoe 38.6 17.2 199 3750 female 2009
Gentoo Biscoe 47.5 14.2 209 4600 female 2008
Adelie Dream 35.7 18.0 202 3550 female 2008
Gentoo Biscoe 40.9 13.7 214 4650 female 2007
Chinstrap Dream 51.5 18.7 187 3250 male 2009
Adelie Biscoe 40.5 17.9 187 3200 female 2007
Gentoo Biscoe 46.6 14.2 210 4850 female 2008
Adelie Dream 39.6 18.8 190 4600 male 2007
Gentoo Biscoe 45.5 13.7 214 4650 female 2007
Penguin species
Species Frequency Share
Adelie 152 0.4418605
Gentoo 124 0.3604651
Chinstrap 68 0.1976744
TOTAL β€” 344 1

Crosstab

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 35.5 17.5 190 3700 female 2008
Chinstrap Dream 46.4 18.6 190 3450 female 2007
Adelie Torgersen 45.8 18.9 197 4150 male 2008
Adelie Torgersen 42.1 19.1 195 4000 male 2008
Gentoo Biscoe 45.5 13.9 210 4200 female 2008
Adelie Dream 36.4 17.0 195 3325 female 2007
Gentoo Biscoe 51.1 16.5 225 5250 male 2009
Gentoo Biscoe 52.1 17.0 230 5550 male 2009
Chinstrap Dream 46.5 17.9 192 3500 female 2007
Adelie Biscoe 35.7 16.9 185 3150 female 2008
Gentoo Biscoe 49.8 16.8 230 5700 male 2008
Chinstrap Dream 46.0 18.9 195 4150 female 2007
Chinstrap Dream 50.3 20.0 197 3300 male 2007
Gentoo Biscoe 45.1 14.4 210 4400 female 2008
Adelie Biscoe 41.1 19.1 188 4100 male 2008

label

variable

island

Total

Biscoe

Dream

Torgersen

species

Adelie

44 (28.95%)

56 (36.84%)

52 (34.21%)

152 (44.19%)

Chinstrap

0 (0%)

68 (100.00%)

0 (0%)

68 (19.77%)

Gentoo

124 (100.00%)

0 (0%)

0 (0%)

124 (36.05%)

Charts

Bar chart

Grouped bar chart

Error bar chart

Histogram

Scatter plot

Box plot

Box plot (cont.)

Levels of measurement

Levels of measurement

  1. Nominal: Categories without order (e.g., species).
  2. Ordinal: Categories with order (e.g., size).
  3. Interval: Equal intervals between values (e.g., temperature).
  4. Ratio: Absolute zero point (e.g., weight).

Nominal variables

Gender

  • 🚹 male
  • 🚺 female

🐧 Species

  • Adelie
  • Chinstrap
  • Gentoo

Preference for news sources

  • πŸ“Ί TV
  • πŸ“° Newspapers
  • 🌐 Internet

Ordinal variables

🍽️ Frequency of eating out

Never < Rarely < Sometimes < Often < Always



πŸŽ“ Education level

Primary < Secondary < Bachelor’s < Master’s < PhD



πŸ†Ranking

First < Second < Third < … < Last

Metric variables

πŸ’°Income

$854, $1325, $1100, $2475, …



🌑️Temperature

0Β°C, 10Β°C, 20Β°C, 30Β°C, …



⚑Electricity consumption

520 kWh, 164 kWh, 340 kWh, 456 kWh, …