Welcome to R

Descriptive Statistics

Ihor Miroshnychenko

Kyiv School of Economics

Getting started

Software installation

  1. Download R.

  2. Download RStudio.

Also you can use:

Some OS-specific extras

  • Windows: Install Rtools. I also recommend that you install Chocolately1.
  • Mac: Install Homebrew. I also recommend that you configure/open your C++ toolchain (see here.)
  • Linux: None (you should be good to go).

Checklist

version$version.string
[1] "R version 4.4.1 (2024-06-14 ucrt)"


1RStudio.Version()$version
1
Requires an interactive session but should return something like [1] ‘2023.12.1.402’.


update.packages(ask = FALSE, checkBuilt = TRUE)

Why R?

Why R and RStudio?

R code example (linear regression)

fit  <- lm(mpg ~ wt, data = mtcars)
summary(fit)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Base R plot

# par(mar = c(4, 4, 1, .1)) ## Just for nice plot margins on this slide deck
plot(mtcars$wt, mtcars$mpg)
abline(fit, col = "red")

ggplot2

library(ggplot2)

ggplot(data = mtcars, aes(x = wt, y = mpg)) + 
  geom_point() +
  geom_smooth(method = "lm", color = "red")

And more…

library(gganimate)
library(gapminder)

ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, colour = country)) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  scale_colour_manual(values = country_colors) +
  scale_size(range = c(2, 12)) +
  scale_x_log10() +
  facet_wrap(~continent) +
  # Here comes the gganimate specific bits
  labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
  transition_time(year) +
  ease_aes('linear')

And more… (cont.)

And more… (cont.)

library(kableExtra)

mpg_list <- split(mtcars$mpg, mtcars$cyl)
disp_list <- split(mtcars$disp, mtcars$cyl)
inline_plot <- data.frame(cyl = c(4, 6, 8), mpg_box = "", mpg_hist = "",
                          mpg_line1 = "", mpg_line2 = "",
                          mpg_points1 = "", mpg_points2 = "", mpg_poly = "")
inline_plot %>%
  kbl(booktabs = TRUE) %>%
  kable_paper(full_width = FALSE) %>%
  column_spec(2, image = spec_boxplot(mpg_list)) %>%
  column_spec(3, image = spec_hist(mpg_list)) %>%
  column_spec(4, image = spec_plot(mpg_list, same_lim = TRUE)) %>%
  column_spec(5, image = spec_plot(mpg_list, same_lim = FALSE)) %>%
  column_spec(6, image = spec_plot(mpg_list, type = "p")) %>%
  column_spec(7, image = spec_plot(mpg_list, disp_list, type = "p")) %>%
  column_spec(8, image = spec_plot(mpg_list, polymin = 5))
cyl mpg_box mpg_hist mpg_line1 mpg_line2 mpg_points1 mpg_points2 mpg_poly
4
6
8

And more… (cont.)

And more… (cont.)

library(leaflet)

content <- paste(sep = "<br/>",
  "<b><a href='https://kse.ua/ua/'>Kyiv School of Economics</a></b>",
  "Mykoly Shpaka St, 3",
  "Kyiv, Ukraine"
)

leaflet() %>% 
  addTiles() %>% 
  addMarkers(lng = 30.4298435, lat = 50.4584603, popup = content)

Some R basics

Basic arithmetic

1 + 2 ## Addition
[1] 3
6 - 7 ## Subtraction
[1] -1
5 / 2 ## Division
[1] 2.5
2^3 ## Exponentiation, 2 ** 3 will also work
[1] 8
2 + 4 * 1^3 ## Standard order of precedence (`*` before `+`, etc.)
[1] 6
100 %/% 60 ## How many whole hours in 100 minutes?
[1] 1
100 %% 60 ## How many minutes are left over?
[1] 40

Logic

1 > 2
[1] FALSE
1 > 2 & 1 > 0.5 ## The "&" stands for "and"
[1] FALSE
1 > 2 | 1 > 0.5 ## The "|" stands for "or"
[1] TRUE
isTRUE (1 < 2)
[1] TRUE

You can read more about logical operators and types here and here.

Logic (cont.)

Order of precedence

Much like standard arithmetic, logic statements follow a strict order of precedence. Logical operators (>, ==, etc) are evaluated before Boolean operators (& and |). Failure to recognise this can lead to unexpected behaviour…

1 > 0.5 & 2
[1] TRUE

What’s happening here is that R is evaluating two separate “logical” statements:

  • 1 > 0.5, which is is obviously TRUE.
  • 2, which is TRUE(!) because R is “helpfully” converting it to as.logical(2).

Solution: Be explicit about each component of your logic statement(s).

1 > 0.5 & 1 > 2
[1] FALSE

Logic (cont.)

Negation: !

We use ! as a short hand for negation. This will come in very handy when we start filtering data objects based on non-missing (i.e. non-NA) observations.

is.na(1:10)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
!is.na(1:10)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# Negate(is.na)(1:10) ## This also works. Try it yourself.

Logic (cont.)

Value matching: %in%

To see whether an object is contained within (i.e. matches one of) a list of items, use %in%.

4 %in% 1:10
[1] TRUE
4 %in% 5:10
[1] FALSE

There’s no equivalent “not in” command, but how might we go about creating one?

  • Hint: Think about negation…
`%ni%`  <- Negate(`%in%`) ## The backticks (`) help to specify functions.
4 %ni% 5:10
[1] TRUE

Logic (cont.)

Evaluation

We’ll get to assignment shortly. However, to preempt it somewhat, we always use two equal signs for logical evaluation.

1 = 1 ## This doesn't work
Error in 1 = 1: invalid (do_set) left-hand side to assignment
1 == 1 ## This does.
[1] TRUE
1 != 2 ## Note the single equal sign when combined with a negation.
[1] TRUE

Logic (cont.)

Evaluation caveat: Floating-point numbers

What do you think will happen if we evaluate 0.1 + 0.2 == 0.3?

0.1 + 0.2 == 0.3
[1] FALSE

Uh-oh! (Or, maybe you’re thinking: Huh??)

Problem: Computers represent numbers as binary (i.e. base 2) floating-points. More here.

  • Fast and memory efficient, but can lead to unexpected behaviour.
  • Similar to the way that standard decimal (i.e. base 10) representation can’t precisely capture certain fractions (e.g. \(\frac{1}{3} = 0.3333...\)).

Solution: Use all.equal() for evaluating floats (i.e fractions).

all.equal(0.1 + 0.2, 0.3)
[1] TRUE

Assignment

In R, we can use either <- or = to handle assignment.

Assignment with <-:

<- is normally read aloud as “gets”. You can think of it as a (left-facing) arrow saying assign in this direction.

a <- 10 + 5
a
[1] 15

Of course, an arrow can point in the other direction too (i.e. ->). So, the following code chunk is equivalent to the previous one, although used much less frequently.

10 + 5 -> a

Assignment (cont.)

Assignment with =

You can also use = for assignment.

b = 10 + 10 ## Note that the assigned object *must* be on the left with "=".
b
[1] 20

Which assignment operator to use?

Most R users seem to prefer <- for assignment, since = also has specific role for evaluation within functions.

Bottom line: Use whichever you prefer. Just be consistent.

Help

For more information on a (named) function or object in R, consult the “help” documentation. For example:

help(plot)

Or, more simply, just use ?:

# This is what most people use.
?plot 

Or, just use F1.


Aside 1: Comments in R are demarcated by #.

  • Hit Ctrl+Shift+c in RStudio to (un)comment whole sections of highlighted code.

Aside 2: See the Examples section at the bottom of the help file?

  • You can run them with the example() function. Try it: example(plot).

Help (cont.)

Vignettes

For many packages, you can also try the vignette() function, which will provide an introduction to a package and it’s purpose through a series of helpful examples.

  • Try running vignette("dplyr") in your console now.

I highly encourage reading package vignettes if they are available.

  • They are often the best way to learn how to use a package.

One complication is that you need to know the exact name of the package vignette(s).

  • E.g. The dplyr package actually has several vignettes associated with it: “dplyr”, “window-functions”, “programming”, etc.
  • You can run vignette() (i.e. without any arguments) to list the available vignettes of every installed package installed on your system.
  • Or, run vignette(all = FALSE) if you only want to see the vignettes of any loaded packages.

Help (cont.)

Similar to vignettes, many packages come with built-in, interactive demos.

To list all available demos on your system:

demo(package = .packages(all.available = TRUE))

To run a specific demo, just tell R which one and the name of the parent package. For example:

demo("graphics", package = "graphics")

Reserved words

We’ve seen that we can assign objects to different names. However, there are a number of special words that are “reserved” in R.

  • These are are fundamental commands, operators and relations in base R that you cannot (re)assign, even if you wanted to.
  • We already encountered examples with the logical operators.

See here for a full list, including (but not limited to):

if 
else 
while 
function 
for
TRUE 
FALSE 
NULL 
Inf 
NaN 
NA

Semi-reserved words

In addition to the list of strictly reserved words, there is a class of words and strings that I am going to call “semi-reserved”.

  • These are named functions or constants (e.g. pi) that you can re-assign if you really wanted to… but already come with important meanings from base R.

Arguably the most important semi-reserved character is c(), which we use for concatenation; i.e. creating vectors and binding different objects together.

my_vector = c(1, 2, 5)
my_vector
[1] 1 2 5

What happens if you type the following? (Try it in your console.)

c = 4
c(1, 2 ,5)

Vectors are very important in R, because the language has been optimised for them. Don’t worry about this now; later you’ll learn what I mean by “vectorising” a function.

Semi-reserved words (cont.)

(Continued from previous slide.)

In this case, thankfully nothing. R is “smart” enough to distinguish between the variable c = 4 that we created and the built-in function c() that calls for concatenation.

However, this is still extremely sloppy coding. R won’t always be able to distinguish between conflicting definitions. And neither will you. For example:

pi
[1] 3.141593
pi = 2
pi
[1] 2

Bottom line: Don’t use (semi-)reserved characters!

Namespace conflicts

A similar issue crops up when we load two packages, which have functions that share the same name. E.g. Look what happens we load the dplyr package.

library(dplyr)

The messages that you see about some object being masked from ‘package:X’ are warning you about a namespace conflict.

  • E.g. Both dplyr and the stats package (which gets loaded automatically when you start R) have functions named “filter” and “lag”.

Namespace conflicts (cont.)

The potential for namespace conflicts is a result of the OOP approach1.

  • Also reflects the fundamental open-source nature of R and the use of external packages. People are free to call their functions whatever they want, so some overlap is only to be expected.

Whenever a namespace conflict arises, the most recently loaded package will gain preference. So the filter() function now refers specifically to the dplyr variant.

But what if we want the stats variant? Well, we have two options:

  1. Temporarily use stats::filter()
  2. Permanently assign filter = stats::filter

Solving namespace conflicts

  1. Use package::function()

We can explicitly call a conflicted function from a particular package using the package::function() syntax. For example:

stats::filter(1:10, rep(1, 2))
Time Series:
Start = 1 
End = 10 
Frequency = 1 
 [1]  3  5  7  9 11 13 15 17 19 NA

We can also use :: for more than just conflicted cases.

  • E.g. Being explicit about where a function (or dataset) comes from can help add clarity to our code. Try these lines of code in your R console.
dplyr::starwars ## Print the starwars data frame from the dplyr package
scales::comma(c(1000, 1000000)) ## Use the comma function, which comes from the scales package

The :: syntax also means that we can call functions without loading package first. E.g. As long as dplyr is installed on our system, then dplyr::filter(iris, Species=="virginica") will work.

Solving namespace conflicts (cont.)

  1. Assign function <- package::function

A more permanent solution is to assign a conflicted function name to a particular package. This will hold for the remainder of your current R session, or until you change it back. E.g.

filter  <- stats::filter ## Note the lack of parentheses.
filter  <- dplyr::filter ## Change it back again.

Solving namespace conflicts (cont.)

  1. Use conflict_prefer()
#> ✖️ dplyr::filter()  masks stats::filter()

library(conflicted)
conflict_prefer("filter", winner = "dplyr")

User-side namespace conflicts

A final thing to say about namespace conflicts is that they don’t only arise from loading packages. They can arise when users create their own functions with a conflicting name.


In a similar vein, one of the most common and confusing errors that even experienced R programmers run into is related to the habit of calling objects “df” or “data”… both of which are functions in base R!


See for yourself by typing ?df or ?data.

Data types and structures

Basic Data Types

  • numeric
    • integer
    • double
  • character
  • logical
  • factor
  • Date

Numeric

  • numeric is the most common data type in R.
  • It can be either an integer or a double (i.e. a floating-point number).
a <- 1 ## integer?
b <- 1.0 ## This is a double
c <- 1.5 ## This is also a double


class(a)
[1] "numeric"
typeof(a)
[1] "double"


a <- 1L ## This is an integer

typeof(a)
[1] "integer"

Character

  • character is used for text data.
  • It is defined by wrapping text in either single or double quotes.
a <- "Hello, world!"
b <- 'Hello, world!'

Logical

  • logical is used for binary data.
a <- TRUE
b <- FALSE

Factor

  • factor is used for categorical data.
  • it can be ordered or unordered.
race <- factor(
  c("istari", "human", "human",
    "elf", "dwarf", "hobbit",
    "hobbit", "hobbit", "hobbit"),
  levels = c("istari", "human", "elf", "dwarf", "hobbit")
  )

race
[1] istari human  human  elf    dwarf  hobbit hobbit hobbit hobbit
Levels: istari human elf dwarf hobbit


lotr_books <- factor(c("The Fellowship of the Ring",
                       "The Return of the King",
                       "The Two Towers"),
                     levels = c("The Fellowship of the Ring",
                                "The Two Towers",
                                "The Return of the King"),
                     ordered = TRUE)

lotr_books
[1] The Fellowship of the Ring The Return of the King    
[3] The Two Towers            
3 Levels: The Fellowship of the Ring < ... < The Return of the King

Data structures

Vector

  • A vector is a one-dimensional array that can hold same type of data.
c(1, 2, 3, 4, 5)
[1] 1 2 3 4 5
c("a", "b", "c", "d", "e")
[1] "a" "b" "c" "d" "e"
c(TRUE, FALSE, TRUE, TRUE, FALSE)
[1]  TRUE FALSE  TRUE  TRUE FALSE
-5:5
 [1] -5 -4 -3 -2 -1  0  1  2  3  4  5
3:-2
[1]  3  2  1  0 -1 -2
seq(1, 10, by = 2)
[1] 1 3 5 7 9
rep(3, 5)
[1] 3 3 3 3 3

Vector (cont.)

  • Vectors can be combined using c().
v1 <- c("Speak", "friend")

v2 <- c("and", "enter")

c(v1, v2)
[1] "Speak"  "friend" "and"    "enter" 

Vector (cont.)

Type of Coercion:

  • implicit coercion
  • explicit coercion
c(TRUE, 2, FALSE)
[1] 1 2 0
3 - TRUE
[1] 2
c(TRUE, 2, "Hello")
[1] "TRUE"  "2"     "Hello"

NULL < raw < logical < integer < double < complex < character < list < expression

Vector (cont.)

Type of Coercion:

  • explicit coercion
as.numeric(c(TRUE, 2, FALSE, FALSE))
[1] 1 2 0 0
as.character(c(TRUE, 2, FALSE, FALSE))
[1] "1" "2" "0" "0"

Matrix

  • A matrix is a two-dimensional array that can hold same type of data.
matrix(1:16, nrow = 4, ncol = 4)
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16
matrix(1:16, nrow = 4, ncol = 4, byrow = TRUE)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
[4,]   13   14   15   16

Popular functions and operators:

  • rbind() and cbind()
  • dim()
  • rownames() and colnames()
  • t()
  • %*%
  • det()

Array

  • An array is a multi-dimensional extension of a matrix.
array(1:16, c(4, 2, 2))
, , 1

     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8

, , 2

     [,1] [,2]
[1,]    9   13
[2,]   10   14
[3,]   11   15
[4,]   12   16

List

  • A list is a collection of objects (vectors, matrices, arrays, etc.) that can be of different types.
list(vec = c(1:5),
    gendalf = "You shall not pass",
    my_matrix = matrix(1:4, ncol = 2))
$vec
[1] 1 2 3 4 5

$gendalf
[1] "You shall not pass"

$my_matrix
     [,1] [,2]
[1,]    1    3
[2,]    2    4

Data frame

  • A data frame is a list of vectors of equal length.
data.frame(name = c("Frodo", "Eowyn", "Legolas", "Arwen"),
           sex = c("male", "female", "male", "female"),
           age = c(51, 24, 2931, 2700),
           one_ring = c(TRUE, FALSE, FALSE, FALSE))
     name    sex  age one_ring
1   Frodo   male   51     TRUE
2   Eowyn female   24    FALSE
3 Legolas   male 2931    FALSE
4   Arwen female 2700    FALSE

Additional materials

Questions?



Course materials

imiroshnychenko@kse.org.ua

@araprof

@datamirosh

@ihormiroshnychenko

@aranaur

aranaur.rbind.io