tidyverse
dplyr
package
filter()
mutate()
ifelse()
|>
summarize()
group_by()
tidy
data formattidyverse
tidyverse
“The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”
… the tidyverse makes data science faster, easier and more fun…
tidyverse
tidyverse
The tidyverse package is a shortcut for installing and loading all the key tidyverse packages
tidyverse
Installs all of these:
install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
install.packages("readr")
install.packages("purrr")
install.packages("tibble")
install.packages("stringr")
install.packages("forcats")
install.packages("lubridate")
install.packages("hms")
install.packages("DBI")
install.packages("haven")
install.packages("httr")
install.packages("jsonlite")
install.packages("readxl")
install.packages("rvest")
install.packages("xml2")
install.packages("modelr")
install.packages("broom")
![]() |
Work with plain text data |
my_data <- read_csv(“file.csv”)
|
![]() |
Work with Excel files |
my_data <- read_excel(“file.xlsx”)
|
![]() |
Work with Stata, SPSS, and SAS data |
my_data <- read_stata(“file.dta”)
|
Data from R-Packages
Some data sets can be downloaded as packages in R. For example, the gapminder data set.
Install the package
Then load the data
dplyr
packagetidyverse
dplyr
dplyr
: verbs for manipulating data
Extract rows with filter()
|
![]() |
Extract columns with select()
|
![]() |
Arrange/sort rows with arrange()
|
![]() |
Make new columns with mutate()
|
![]() |
Make group summaries with group_by() |> summarize()
|
![]() |
filter()
Extract rows that meet some sort of test
Test | Meaning | Test | Meaning |
---|---|---|---|
x < y |
Less than | x %in% y |
In (group membership) |
x > y |
Greater than | is.na(x) |
Is missing |
== |
Equal to | !is.na(x) |
Is not missing |
x <= y |
Less than or equal to | ||
x >= y |
Greater than or equal to | ||
x != y |
Not equal to |
Use filter()
and logical tests to show…
04:00
Use filter()
and logical tests to show…
Using =
instead of ==
Forgetting quotes (""
)
filter()
with multiple conditionsExtract rows that meet every test
# A tibble: 2 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Denmark Europe 2002 77.2 5374693 32167.
2 Denmark Europe 2007 78.3 5468120 35278.
Operator | Meaning |
---|---|
a & b |
and |
a | b |
or |
!a |
not |
The default is “and”
These do the same thing:
Use filter()
and Boolean logical tests to show…
04:00
Use filter()
and Boolean logical tests to show…
Collapsing multiple tests into one
Using multiple tests instead of %in%
Every dplyr
verb function follows the same pattern
mutate()
Create new columns
mutate()
Create new columns
We can also create multiple new columns at once
country | year | … | gdp | pop_mil |
---|---|---|---|---|
Afghanistan | 1952 | … | 6567086330 | 8 |
Afghanistan | 1957 | … | 7585448670 | 9 |
Afghanistan | 1962 | … | 8758855797 | 10 |
Afghanistan | 1967 | … | 9648014150 | 12 |
Afghanistan | 1972 | … | 9678553274 | 13 |
Afghanistan | 1977 | … | 11697659231 | 15 |
ifelse()
Do conditional tests within mutate()
ifelse()
The new variable can take any sort of class
Use mutate()
to…
africa
column that is TRUE if the country is on the African continentlog()
)africa_asia
column that says “Africa or Asia” if the country is in Africa or Asia, and “Not Africa or Asia” if it’s not05:00
Use mutate()
to…
africa
column that is TRUE if the country is on the African continentlog()
)Solution 1: Intermediate variables
Solution 2: Nested functions
|>
Why using pipes?
… 🤯 not easy to read
|>
vs %>%
There are actually multiple pipes!
%>%
was invented first, but requires a package to use
|>
is part of base R
They’re interchangeable 99% of the time (Just be consistent)
You do not have to type the pipe by hand every time
You can use the shortcut cmd + shift + m
in R Studio.
summarize()
Compute a table of summaries
country | continent | year | lifeExp |
---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 |
Afghanistan | Asia | 1957 | 30.332 |
Afghanistan | Asia | 1962 | 31.997 |
Afghanistan | Asia | 1967 | 34.02 |
… | … | … | … |
Use summarize()
to calculate…
05:00
One Solution for all:
Use filter()
and summarize()
to calculate…
on the African continent in 2007.
05:00
Use filter()
and summarize()
to calculate…
on the African continent in 2007.
group_by()
Put rows into groups based on values in a column
Nothing happens by itself!
Powerful when combined with summarize()
group_by()
country | continent | year | lifeExp |
---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 |
Afghanistan | Asia | 1957 | 30.332 |
Afghanistan | Asia | 1962 | 31.997 |
Afghanistan | Asia | 1967 | 34.02 |
… | … | … | … |
A simple summary
# A tibble: 1 × 1
n_countries
<int>
1 142
05:00
dplyr
: verbs for manipulating data
Extract rows with filter()
|
![]() |
Extract columns with select()
|
![]() |
Arrange/sort rows with arrange()
|
![]() |
Make new columns with mutate()
|
![]() |
Make group summaries with group_by() |> summarize()
|
![]() |
You can represent the same underlying data in multiple ways.
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Tidy
data has the following properties:
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
There are two main advantages:
dplyr
, ggplot2
, and all the other packages in the tidyverse are designed to work with tidy data.
Yes, unfortunately, most real data is untidy.
There are two main reasons:
Data is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.
Most people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.
tidyr
provides two main functions to “pivot” data in a tidy format:
pivot_longer()
and
pivot_wider()
Here, we’ll only discuss pivot_longer()
because it’s the most common case.
pivot_longer()
id
s A, B, and C, and we take two blood pressure measurements on each patient.tribble()
, a handy function for constructing small tibbles by hand:pivot_longer()
id
(already exists), measurement
(the column names), and value
(the cell values)df
longerpivot_longer()
id
) need to be repeated, once for each column that is pivoted.pivot_longer()
measurement
pivot_longer()
value
pivot_longer()
# A tibble: 3 × 3
id bp1 bp2
<chr> <dbl> <dbl>
1 A 100 120
2 B 140 115
3 C 120 125
There are three key arguments:
cols
specifies which columns need to be pivoted, i.e. which columns aren’t variables. This argument uses the same syntax as select()
names_to
names the variable in which column names should be storedvalues_to
names the variable in which cell values should be storedThe billboard
dataset which comes with the tidyverse
package records the billboard rank of songs in the year 2000.
# A tibble: 6 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
3 3 Doors Do… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
4 3 Doors Do… Loser 2000-10-21 76 76 72 69 67 65 55 59
5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
# wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
artist
, track
and date.entered
) are variables that describe the song.wk1
-wk76
) that describe the rank of the song in each week.Use pivot_longer()
to tidy the data (Tip: Create the new variables week
and rank
). Assign the resulting data frame to a new data frame called tidy_billboard
.
Use the new tidy_billboard
data frame to calculate which song has been the longest on rank 1 (Tip: use filter()
, group_by()
and summarize()
)
05:00
pivot_longer()
to tidy the data (Tip: Create the new variables week
and rank
). Assign the resulting data frame to a new data frame called tidy_billboard
.tidy_billboard
data frame to calculate which song has been the longest on rank 1 (Tip: use filter()
, group_by()
and summarize()
)