library(tidyverse)
library(janitor)
Problem set 2
Getting started
Read the data
Rows: 195 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Breed
dbl (8): 2013 Rank, 2014 Rank, 2015 Rank, 2016 Rank, 2017 Rank, 2018 Rank, 2...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 195 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Breed, Coat Type, Coat Length
dbl (14): Affectionate With Family, Good With Young Children, Good With Othe...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("data/breed_rank.csv")
breed_rank <- read_csv("data/breed_traits.csv") breed_traits
Clean the data
Display variables.
names(breed_rank)
[1] "Breed" "2013 Rank" "2014 Rank" "2015 Rank" "2016 Rank" "2017 Rank"
[7] "2018 Rank" "2019 Rank" "2020 Rank"
names(breed_traits)
[1] "Breed" "Affectionate With Family"
[3] "Good With Young Children" "Good With Other Dogs"
[5] "Shedding Level" "Coat Grooming Frequency"
[7] "Drooling Level" "Coat Type"
[9] "Coat Length" "Openness To Strangers"
[11] "Playfulness Level" "Watchdog/Protective Nature"
[13] "Adaptability Level" "Trainability Level"
[15] "Energy Level" "Barking Level"
[17] "Mental Stimulation Needs"
Make better names.
<- breed_traits |>
breed_traits clean_names()
Manipulate the data using dplyr
Maka a summary.
|>
breed_traits group_by(shedding_level) |>
summarise(n = n())
# A tibble: 6 × 2
shedding_level n
<dbl> <int>
1 0 1
2 1 27
3 2 41
4 3 109
5 4 16
6 5 1
Filter the shedding_level 0.
<- breed_traits |>
breed_traits filter(shedding_level != 0)
Check if manipulation was successful.
|> count(shedding_level) breed_traits
# A tibble: 5 × 2
shedding_level n
<dbl> <int>
1 1 27
2 2 41
3 3 109
4 4 16
5 5 1
Make an untidy data frame.
<- breed_traits |>
untidy_scores mutate(untidy_score = shedding_level +
+ drooling_level) |>
coat_grooming_frequency select(breed, untidy_score)
Arrange scores in descending order.
|>
untidy_scores arrange(desc(untidy_score))
# A tibble: 194 × 2
breed untidy_score
<chr> <dbl>
1 Bernese Mountain Dogs 11
2 Leonbergers 11
3 Newfoundlands 10
4 Bloodhounds 10
5 St. Bernards 10
6 Old English Sheepdogs 10
7 Dogues de Bordeaux 10
8 Neapolitan Mastiffs 10
9 Black Russian Terriers 10
10 Tibetan Mastiffs 10
# ℹ 184 more rows
Tidying the data
How does this this data set fail to meet the criteria for tidy data?
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
We have a year and a rank variable, but neither of these variables have their own column. Shown above is one observation, by dog breed. But that “one” observation is actually eight separate observations: the rank in 2013, the rank in 2014, etc. Each observation needs to have its own row.
Make pivoted data with a year and a rank variable.
<- breed_rank |>
ranks_pivoted pivot_longer(`2013 Rank`:`2020 Rank`,
names_to = "year",
values_to = "rank")
Rename breed and make the year variable numeric.
<- ranks_pivoted |>
ranks_pivoted rename(breed = Breed) |>
mutate(year = parse_number(year))
Filter data to only Bernese Mountain Dogs.
<- ranks_pivoted |>
ranks_pivoted filter(str_detect(breed, "Bernese"))
Plot rankings across time.
|>
ranks_pivoted ggplot(aes(x = year, y = rank, label = rank)) +
geom_point(size = 3) +
geom_text(vjust = 2)