Data manipulation basics

Overview

  1. The tidyverse
    • R Packages
    • Importing data
  2. The dplyr package
    • filter()
    • mutate()
    • ifelse()
    • pipes |>
    • summarize()
    • group_by()
  1. The tidy data format

The tidyverse

Packages

Packages

Packages

  • So far we only used functions that are directly available in R
    • But there are tons of user-created functions out there that can make your life so much easier
    • These functions are shared in what we call packages
  • Packages are bundles of functions that R users put at the disposal of other R users
    • Packages are centralized on the Comprehensive R Archive Network (CRAN)
    • To download and install a CRAN package you can simply type `install.packages()

Using packages

install.packages("name")
  • files to your computer

  • Do this once per computer

library("name")
  • Loads the package

  • Do this once per R session

The tidyverse

“The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”

… the tidyverse makes data science faster, easier and more fun…

The tidyverse

The tidyverse

library(tidyverse)

The tidyverse package is a shortcut for installing and loading all the key tidyverse packages

The tidyverse

install.packages("tidyverse")

Installs all of these:

install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
install.packages("readr")
install.packages("purrr")
install.packages("tibble")
install.packages("stringr")
install.packages("forcats")
install.packages("lubridate")
install.packages("hms")
install.packages("DBI")
install.packages("haven")
install.packages("httr")
install.packages("jsonlite")
install.packages("readxl")
install.packages("rvest")
install.packages("xml2")
install.packages("modelr")
install.packages("broom")
library(tidyverse)

Loads all of these:

library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
library(tibble)
library(stringr)
library(forcats)
library(lubridate)

Importing data

readr Work with plain text data my_data <- read_csv(“file.csv”)
readxl Work with Excel files my_data <- read_excel(“file.xlsx”)
haven Work with Stata, SPSS, and SAS data my_data <- read_stata(“file.dta”)

Data from R-Packages

Some data sets can be downloaded as packages in R. For example, the gapminder data set.

Install the package

install.packages(gapminder)

Then load the data

library(gapminder)

# The data() function in R is used to list, load, 
# and access built-in or package-provided datasets. 
data(gapminder) 

The dplyr package

tidyverse

dplyr

dplyr: verbs for manipulating data

Extract rows with filter() filter
Extract columns with select() select
Arrange/sort rows with arrange() arrange
Make new columns with mutate() mutate
Make group summaries with group_by() |> summarize() summarize

filter()

Extract rows that meet some sort of test

The general idea:

filter(
  some_data, 
  ... # one or more tests 
  )

Let’s try this on the gapminder data set that you’ve installed earlier.

filter(.data = gapminder, country == "Denmark")
country continent year
Denmark Europe 1952
Denmark Europe 1957
Denmark Europe 1962
Denmark Europe 1967
Denmark Europe 1972

Logical tests

Test Meaning Test Meaning
x < y Less than x %in% y In (group membership)
x > y Greater than is.na(x) Is missing
== Equal to !is.na(x) Is not missing
x <= y Less than or equal to
x >= y Greater than or equal to
x != y Not equal to

Your turn #1: Filtering

Use filter() and logical tests to show…

  1. The data for Canada
  2. All data for countries in Oceania
  3. Rows where the life expectancy is greater than 82
04:00

Your turn #1: Filtering

Use filter() and logical tests to show…

  1. The data for Canada
filter(gapminder, country == "Canada")
  1. All data for countries in Oceania
filter(gapminder, continent == "Oceania")
  1. Rows where the life expectancy is greater than 82
filter(gapminder, lifeExp > 82)

Common Mistakes

Using = instead of ==

Bad

filter(gapminder, country = "Canada")

Good

filter(gapminder, country == "Canada")



Forgetting quotes ("")

Bad

filter(gapminder, country == Canada)

Good

filter(gapminder, country == "Canada")

filter() with multiple conditions

Extract rows that meet every test

filter(gapminder, country == "Denmark", year > 2000)
# A tibble: 2 × 6
  country continent  year lifeExp     pop gdpPercap
  <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
1 Denmark Europe     2002    77.2 5374693    32167.
2 Denmark Europe     2007    78.3 5468120    35278.

Boolean operators

Operator Meaning
a & b and
a | b or
!a not

Boolean operators


The default is “and”



These do the same thing:

filter(gapminder, 
       country == "Denmark", 
       year > 2000)
filter(gapminder, 
       country == "Denmark" & 
         year > 2000)

Your turn #2: Filtering

Use filter() and Boolean logical tests to show…

  1. Canada before 1970
  2. Countries where life expectancy in 2007 is below 50
  3. Countries where life expectancy in 2007 is below 50 and are not in Africa
04:00

Your turn #2: Filtering

Use filter() and Boolean logical tests to show…

  1. Canada before 1970
filter(gapminder, country == "Canada", year < 1970)
  1. Countries where life expectancy in 2007 is below 50
filter(gapminder, year == 2007, lifeExp < 50)
  1. Countries where life expectancy in 2007 is below 50 and are not in Africa
filter(gapminder, year == 2007, lifeExp < 50, 
       continent != "Africa")

Common Mistakes

Collapsing multiple tests into one

Bad

filter(gapminder, 
       1960 < year < 1980)

Good

filter(gapminder,
       year > 1960, 
       year < 1980)



Using multiple tests instead of %in%

Bad

filter(gapminder,
       country == "Mexico",
       country == "Canada",
       country == "United States")

Good

filter(gapminder,
       country %in% c("Mexico", "Canada",
                      "United States"))

Common Syntax

Every dplyr verb function follows the same pattern


verb(data, ...)

verb = dplyr function/verb

data = data frame to transfom

... = what you the verb to do exatly

mutate()

Create new columns

The general idea:

mutate(
  some_data, 
  ... # new columns to make
  )

Let’s try this on the gapminder data

mutate(gapminder, gdp = gdpPercap * pop)
country year gdp
Afghanistan 1952 6567086330
Afghanistan 1957 7585448670
Afghanistan 1962 8758855797
Afghanistan 1967 9648014150
Afghanistan 1972 9678553274
Afghanistan 1977 11697659231

mutate()

Create new columns

The general idea:

mutate(
  some_data, 
  ... # new columns to make
  )

We can also create multiple new columns at once

mutate(gapminder, gdp = gdpPercap * pop,
                  pop_mil = round(pop / 1000000))
country year gdp pop_mil
Afghanistan 1952 6567086330 8
Afghanistan 1957 7585448670 9
Afghanistan 1962 8758855797 10
Afghanistan 1967 9648014150 12
Afghanistan 1972 9678553274 13
Afghanistan 1977 11697659231 15

ifelse()

Do conditional tests within mutate()


ifelse(test,
       value_if_true, 
       value_if_false)

test = a logical test

value_if_true = what happens if test is true

value_if_false = what happens if test is false

ifelse()

The new variable can take any sort of class

# a new logical variable
mutate(gapminder, 
       after_1960 = ifelse(year > 1960, TRUE, FALSE)
       )


# a new character variable
mutate(gapminder, 
       after_1960 = ifelse(year > 1960, 
                           "After 1960", 
                           "Before 1960")
       )


# a new numeric variable
mutate(gapminder, 
       after_1960 = ifelse(year > 1960, 0, 1)
       )

Your turn #3: Mutating

Use mutate() to…

  1. Add an africa column that is TRUE if the country is on the African continent
  2. Add a column for logged GDP per capita (hint: use log())
  3. Add an africa_asia column that says “Africa or Asia” if the country is in Africa or Asia, and “Not Africa or Asia” if it’s not
05:00

Your turn #3: Mutating

Use mutate() to…

  1. Add an africa column that is TRUE if the country is on the African continent
mutate(gapminder, africa = ifelse(continent == "Africa", 
                                  TRUE, FALSE))
  1. Add a column for logged GDP per capita (hint: use log())
mutate(gapminder, log_gdpPercap = log(gdpPercap))
  1. Add an africa_asia column that says “Africa or Asia” if the country is in Africa or Asia, and “Not Africa or Asia” if it’s not
mutate(gapminder, 
       africa_asia = 
         ifelse(continent %in% c("Africa", "Asia"), 
                "Africa or Asia", 
                "Not Africa or Asia"))

What if you have multiple verbs?

Solution 1: Intermediate variables

gapminder_2002 <- filter(gapminder, year == 2002)

gapminder_2002_log <- mutate(gapminder_2002,
                             log_gdpPercap = log(gdpPercap))

Solution 2: Nested functions

filter(mutate(gapminder_2002, 
              log_gdpPercap = log(gdpPercap)), 
       year == 2002)

Solution 3: Pipes!

  • The |> operator (pipe) takes an object on the left and passes it as the first argument of the function on the right
gapminder |> 
  filter(year == 2002) |> 
  mutate(log_gdpPercap = log(gdpPercap))

|>

Why using pipes?

leave_house(get_dressed(get_out_of_bed(wake_up(me, time = "8:00"), side = "correct"),
    pants = TRUE, shirt = TRUE), car = TRUE, bike = FALSE)

… 🤯 not easy to read


me |> 
  wake_up(time = "8:00") |> 
  get_out_of_bed(side = "correct") |> 
  get_dressed(pants = TRUE, shirt = TRUE) |> 
  leave_house(car = TRUE, bike = FALSE)

… 🎉 easy to read

|> vs %>%

  • There are actually multiple pipes!

  • %>% was invented first, but requires a package to use

  • |> is part of base R

  • They’re interchangeable 99% of the time (Just be consistent)

You do not have to type the pipe by hand every time

You can use the shortcut cmd + shift + m in R Studio.

summarize()

Compute a table of summaries

  1. Take a data frame
country continent year lifeExp
Afghanistan Asia 1952 28.801
Afghanistan Asia 1957 30.332
Afghanistan Asia 1962 31.997
Afghanistan Asia 1967 34.02
  1. Make a summary
gapminder |>
    summarize(mean_life = mean(lifeExp))
# A tibble: 1 × 1
  mean_life
      <dbl>
1      59.5

Or several summaries

gapminder |> 
  summarize(mean_life = mean(lifeExp),
            min_life = min(lifeExp))
# A tibble: 1 × 2
  mean_life min_life
      <dbl>    <dbl>
1      59.5     23.6

Your turn #4: Summarizing

Use summarize() to calculate…

  1. The first (minimum) year in the dataset
  2. The last (maximum) year in the dataset
  3. The number of rows in the dataset (use the cheatsheet)
  4. The number of distinct countries in the dataset (use the cheatsheet)
05:00

Your turn #4: Summarizing

One Solution for all:

gapminder |> 
  summarize(first = min(year),
            last = max(year),
            num_rows = n(),
            num_unique = n_distinct(country))
# A tibble: 1 × 4
  first  last num_rows num_unique
  <int> <int>    <int>      <int>
1  1952  2007     1704        142

Your turn #5: Summarizing

Use filter() and summarize() to calculate…

  1. the number of unique countries and
  2. the median life expectancy

on the African continent in 2007.

05:00

Your turn #5: Summarizing

Use filter() and summarize() to calculate…

  1. the number of unique countries and
  2. the median life expectancy

on the African continent in 2007.

gapminder |>
  filter(continent == "Africa", year == 2007) |>
  summarise(n_countries = n_distinct(country), 
            med_le = median(lifeExp))
# A tibble: 1 × 2
  n_countries med_le
        <int>  <dbl>
1          52   52.9

group_by()

Put rows into groups based on values in a column

gapminder |> group_by(continent)
  • Nothing happens by itself!

  • Powerful when combined with summarize()

group_by()

country continent year lifeExp
Afghanistan Asia 1952 28.801
Afghanistan Asia 1957 30.332
Afghanistan Asia 1962 31.997
Afghanistan Asia 1967 34.02

A simple summary

gapminder |>
    summarize(n_countries = n_distinct(country))
# A tibble: 1 × 1
  n_countries
        <int>
1         142

A grouped summary

gapminder |> 
  group_by(continent) |> 
  summarize(n_countries = n_distinct(country)) 
# A tibble: 5 × 2
  continent n_countries
  <fct>           <int>
1 Africa             52
2 Americas           25
3 Asia               33
4 Europe             30
5 Oceania             2

Your turn #6: Grouping and summarizing

  1. Find the minimum, maximum, and median life expectancy for each continent


  1. Find the minimum, maximum, and median life expectancy for each continent in 2007 only
05:00

Your turn #6: Grouping and summarizing

  1. Find the minimum, maximum, and median life expectancy for each continent
gapminder |> 
  group_by(continent) |> 
  summarize(min_le = min(lifeExp),
            max_le = max(lifeExp),
            med_le = median(lifeExp))
  1. Find the minimum, maximum, and median life expectancy for each continent in 2007 only
gapminder |> 
  filter(year == 2007) |> 
  group_by(continent) |> 
  summarize(min_le = min(lifeExp),
            max_le = max(lifeExp),
            med_le = median(lifeExp))

dplyr: verbs for manipulating data

Extract rows with filter() filter
Extract columns with select() select
Arrange/sort rows with arrange() arrange
Make new columns with mutate() mutate
Make group summaries with group_by() |> summarize() summarize

Tidy data

Tidy data

You can represent the same underlying data in multiple ways.

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Tidy data

Tidy data has the following properties:

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

  1. Variables are columns
  2. Observations are rows
  3. Values are cells

Why ensure that your data is tidy?

There are two main advantages:

  1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
  1. There’s a specific advantage to placing variables in columns because it allows R’s vectorized nature to shine. As you learned in ?@sec-mutate and ?@sec-summarize, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.

Will I ever encounter a dataset that isn’t tidy?

Yes, unfortunately, most real data is untidy.

There are two main reasons:

  1. Data is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.

  2. Most people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.

Pivoting data

tidyr provides two main functions to “pivot” data in a tidy format:

  • pivot_longer()

and

  • pivot_wider()


Here, we’ll only discuss pivot_longer() because it’s the most common case.

pivot_longer()

  • Suppose we have three patients with ids A, B, and C, and we take two blood pressure measurements on each patient.
  • We’ll create the data with tribble(), a handy function for constructing small tibbles by hand:
df <- tribble(
  ~id,  ~bp1, ~bp2,
   "A",  100,  120,
   "B",  140,  115,
   "C",  120,  125
)

df
# A tibble: 3 × 3
  id      bp1   bp2
  <chr> <dbl> <dbl>
1 A       100   120
2 B       140   115
3 C       120   125

pivot_longer()

  • We want our new dataset to have three variables: id (already exists), measurement (the column names), and value (the cell values)
  • To achieve this, we need to pivot df longer

pivot_longer()

  • The values in a column that was already a variable in the original dataset (id) need to be repeated, once for each column that is pivoted.

pivot_longer()

  • The column names become values of the new variable measurement

pivot_longer()

  • The cell values become values of the new variable value

pivot_longer()

# A tibble: 3 × 3
  id      bp1   bp2
  <chr> <dbl> <dbl>
1 A       100   120
2 B       140   115
3 C       120   125
df |> 
  pivot_longer(
    cols = bp1:bp2,
    names_to = "measurement",
    values_to = "value"
  )
# A tibble: 6 × 3
  id    measurement value
  <chr> <chr>       <dbl>
1 A     bp1           100
2 A     bp2           120
3 B     bp1           140
4 B     bp2           115
5 C     bp1           120
6 C     bp2           125

There are three key arguments:

  • cols specifies which columns need to be pivoted, i.e. which columns aren’t variables. This argument uses the same syntax as select()
  • names_to names the variable in which column names should be stored
  • values_to names the variable in which cell values should be stored

Your turn #7: Pivoting

The billboard dataset which comes with the tidyverse package records the billboard rank of songs in the year 2000.

head(billboard)
# A tibble: 6 × 79
  artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
  <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac       Baby… 2000-02-26      87    82    72    77    87    94    99    NA
2 2Ge+her     The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
3 3 Doors Do… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
4 3 Doors Do… Loser 2000-10-21      76    76    72    69    67    65    55    59
5 504 Boyz    Wobb… 2000-04-15      57    34    25    17    17    31    36    49
6 98^0        Give… 2000-08-19      51    39    34    26    26    19     2     2
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
#   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
  • In this dataset, each observation is a song.
  • The first three columns (artist, track and date.entered) are variables that describe the song.
  • Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week.

Your turn #7: Pivoting

  1. Use pivot_longer() to tidy the data (Tip: Create the new variables week and rank). Assign the resulting data frame to a new data frame called tidy_billboard.

  2. Use the new tidy_billboard data frame to calculate which song has been the longest on rank 1 (Tip: use filter(), group_by() and summarize())

05:00

Your turn #7: Pivoting

  1. Use pivot_longer() to tidy the data (Tip: Create the new variables week and rank). Assign the resulting data frame to a new data frame called tidy_billboard.
tidy_billboard <- billboard |> 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank"
  ) 
  1. Use the new tidy_billboard data frame to calculate which song has been the longest on rank 1 (Tip: use filter(), group_by() and summarize())
tidy_billboard |> 
  filter(rank == 1) |> 
  group_by(track) |> 
  summarize(weeks_on_rank_1 = n()) |> 
  arrange(desc(weeks_on_rank_1)) 

# alternative solution
tidy_billboard |> 
  group_by(track) |> 
  summarize(weeks_on_rank_1 = sum(rank == 1, na.rm = TRUE)) |> 
  arrange(desc(weeks_on_rank_1)) 

That’s it for today :)