Emphasizing Publishers Does Not Effectively Reduce Susceptibility to Misinformation on Social Media.

highlight source
Author

Dias, Nicholas

Published

2020

Reference

Dias, Nicholas, Gordon Pennycook, and David G. Rand. 2020. “Emphasizing Publishers Does Not Effectively Reduce Susceptibility to Misinformation on Social Media.” Harvard Kennedy School Misinformation Review, January. https://doi.org/10.37016/mr-2020-001.

Intervention

Code
intervention_info <- tibble(
    intervention_description = 'Study 1: In the control condition, participants saw Facebook-like news posts, with the source domain shwon in gray text. Two treatment conditions: in one the logo of publisher outlet was shown in a bright banner (logo banner); in the other no source was shown (neither gray text, nor logo banner). Study 2: Like study 1, but without the "no source" condition.',
    control_format = "facebook",
    control_selection = "facebook",
    control_selection_description = "For Study 1, we will use the Facebook like condition (facebook) and NOT the condition without a source (no_source) as a control group, since it matches the control group of Study 2, is more comparable with other studies\' control groups, and is closer to real-world settings.",
    originally_identified_treatment_effect = FALSE)

# display
show_conditions(intervention_info)
intervention_description control_selection_description
Study 1: In the control condition, participants saw Facebook-like news posts, with the source domain shwon in gray text. Two treatment conditions: in one the logo of publisher outlet was shown in a bright banner (logo banner); in the other no source was shown (neither gray text, nor logo banner). Study 2: Like study 1, but without the "no source" condition. For Study 1, we will use the Facebook like condition (facebook) and NOT the condition without a source (no_source) as a control group, since it matches the control group of Study 2, is more comparable with other studies' control groups, and is closer to real-world settings.

Notes

Studies 3, 4 and 5 are not relevant, as participants did not provide accuracy ratings, but instead rated the trustworthiness of different sources. Study 6 would in principle be relevant, as it tests an intervention of showcasing the source. However, other than in Studies 1 and 2, in the baseline control condition of Study 6, the text of the news headline was presented in isolation, i.e. plain text and no a source. The treatment effect here is thus not highlighting a source, but adding a source in the first place. Since many other studies use a facebook format where the source is present, and since this format is more realistic in real-world context, we prefer using this as a baseline. We therefore exclude Study 6.

Something that is weird: In Study 1, there is a slightly reduced number of WorkerIDs than there are ResponseIDs, suggesting that some individuals might have taken the survey several times (see below). In Study 2, there is no ResponseID variable, but in the original wide format data there is also one more completed study (i.e. line) than individual WorkerIDs (see below).

To err on the side of caution, we exclude all WorkerIDs with multiple survey takes.

Data Cleaning

Study 1

Code
d <- read_csv("dias_2020-study_1.csv") 
Rows: 563 Columns: 602
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (14): Party_TEXT, IPAddress, StartDate, EndDate, WorkerID, Comments, Le...
dbl (588): Condition, Fake, Real, Fake_C, Real_C, Fake_L, Real_L, Fake_PC, F...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
head(d)
# A tibble: 6 × 602
  Condition  Fake  Real Fake_C Real_C Fake_L Real_L Fake_PC Fake_nPC Real_PC
      <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>    <dbl>   <dbl>
1         1  1.25  1.17   1      1      1.5    1.33    1.5      1       1.33
2         1  1.67  2.17   1.5    2      1.83   2.33    1.83     1.5     2.33
3         1  1.75  3.08   1.33   2.83   2.17   3.33    2.17     1.33    3.33
4         1  1.25  2.42   1      3      1.5    1.83    1.5      1       1.83
5         1  1.17  3.17   1      2.83   1.33   3.5     1.33     1       3.5 
6         1  1.58  2      1.33   2      1.83   2       1.83     1.33    2   
# ℹ 592 more variables: Real_nPC <dbl>, Discernment <dbl>, C_Discernment <dbl>,
#   L_Discernment <dbl>, PC_Discernment <dbl>, nPC_Discernment <dbl>,
#   CRT_split <dbl>, ClintonTrump <dbl>, Media_Leaders <dbl>, Media_Bias <dbl>,
#   Fake_SM <dbl>, Real_SM <dbl>, Fake_SM_C <dbl>, Real_SM_C <dbl>,
#   Fake_SM_L <dbl>, Real_SM_L <dbl>, Fake_SM_PC <dbl>, Fake_SM_nPC <dbl>,
#   Real_SM_PC <dbl>, Real_SM_nPC <dbl>, SocialMedia_Chk <dbl>, CRT_ACC <dbl>,
#   CRT_Rand <dbl>, CRT_Thomson <dbl>, Age <dbl>, Sex <dbl>, Education <dbl>, …

accuracy_raw, veracity

There is no documentation. But from the stata code the authors provide, we know that:

  • Columns that, in their endings, contain _2 (like Fake1_2 or Real1_2) represent accuracy ratings made by participants for each news item. It’s a mess, apparently each control condition has their own different outcome variable.

  • Real and Fake in the titles refer to whether the news item was true (real) or **false (fake`).

We bring the data into long format and build an accuracy outcome column.

Code
d_long <- d |>
  pivot_longer(
    cols = matches("^(Real|Fake)\\d+_(2(\\.\\d)?|3(\\.\\d)?)$"),  # match _2, _2.0, _2.1, _3, _3.0, _3.1
    names_to = c("veracity", "item", "measure"),
    names_pattern = "^(Real|Fake)(\\d+)_([23](?:\\.\\d)?)$", 
    values_to = "value"
  ) |>
  mutate(
    measure = case_when(
      str_starts(measure, "2") ~ "accuracy",
      str_starts(measure, "3") ~ "sharing"
    )
  ) |>
  # remove all NAs, which inevitably are existant, since each condition
  drop_na(value) |> 
  pivot_wider(names_from = measure, values_from = value) |> 
  rename(accuracy_raw = accuracy)

# check
d_long |>
  group_by(Condition) |>
  summarise(n_participants = n_distinct(WorkerID),
            n_valid_response = sum(!is.na(accuracy_raw)))
# A tibble: 3 × 3
  Condition n_participants n_valid_response
      <dbl>          <int>            <int>
1         1            190             4555
2         2            181             4340
3         3            191             4605

We code the veracity variable.

Code
d_long <- d_long |> 
  mutate(
    veracity = if_else(veracity == "Fake", "false", "true")
    ) 

scale

Code
table(d_long$accuracy_raw, useNA = "always")

   1    2    3    4 <NA> 
4930 3388 3822 1360   11 
Code
d_long <- d_long |>
  mutate(scale = 4)

Conditions (intervention_label, control_label, condition)

From the stata code we can conclude that in Study 1, the conditions are no_source (coded as 1), facebook (coded 2), highlight_banner (coded as 3). In study 2, the conditions are coded as facebook (coded 1), highlight_banner (coded as 2).

Code
# check
d_long |> 
  group_by(Condition) |> 
  summarise(mean(accuracy_raw, na.rm=TRUE))
# A tibble: 3 × 2
  Condition `mean(accuracy_raw, na.rm = TRUE)`
      <dbl>                              <dbl>
1         1                               2.12
2         2                               2.08
3         3                               2.16

We code the condition variable.

Code
d_long <- d_long |>
  mutate(
    intervention_label = case_when(
      Condition == 3 ~ "highlight_banner",
      TRUE ~ NA_character_
    ),
    control_label = case_when(
      Condition == 1 ~ "no_source",
      Condition == 2 ~ "facebook",
      TRUE ~ NA_character_
    ),
    condition = if_else(Condition == 3, "treatment", "control")
  )

news_id

We have previously coded item, but this is not yet a unique item identifier–these numbers only identify items within each veracity category.

Code
d_long |> 
  group_by(veracity) |> 
  summarise(n_distinct(item))
# A tibble: 2 × 2
  veracity `n_distinct(item)`
  <chr>                 <int>
1 false                    12
2 true                     12

For our news identifier, we therefore combine the veracity variable with these identifiers.

Code
d_long <- d_long |> 
  mutate(news_id = paste0(veracity, "_", item))

age

Code
d_long <- d_long |> 
  mutate(age = Age
         )

year

Code
d_long <- d_long |> 
  mutate(year = year(mdy(StartDate))
         )

Identifiers (subject_id, experiment_id) & removing respondents with mutliple surveys

The original wide format data had 563 lines, i.e. completed studies. There is one worker with two surveys, which we will exclude.

Code
d |> 
  group_by(WorkerID) |> 
  summarise(n_surveys_taken = n(), 
            n_different_start_dates = n_distinct(StartDate)) |> 
  filter(n_surveys_taken > 1)
# A tibble: 1 × 3
  WorkerID       n_surveys_taken n_different_start_dates
  <chr>                    <int>                   <int>
1 A15LHHN76OW2UM               2                       1
Code
d_long_remove_doubles <- d_long |>
  filter(WorkerID != "A15LHHN76OW2UM")

# check 
n_distinct(d_long_remove_doubles$WorkerID)
[1] 561
Code
# check 
n_distinct(d_long$WorkerID)
[1] 562
Code
d1 <- d_long_remove_doubles |> 
  mutate(subject_id = WorkerID, 
         experiment_id = 1) 

Study 2

Code
d <- read_csv("dias_2020-study_2.csv") 
Rows: 1890 Columns: 445
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (16): ResponseID, ResponseSet, Name, IPAddress, StartDate, EndDate, Wor...
dbl (427): Condition, Status, StartDateNum, Finished, confirmCode, IDInst, I...
lgl   (2): ExternalDataReference, Email

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
head(d)
# A tibble: 6 × 445
  Condition ResponseID   ResponseSet Name  ExternalDataReference Email IPAddress
      <dbl> <chr>        <chr>       <chr> <lgl>                 <lgl> <chr>    
1         2 R_1NfzxXhNc… Default Re… Anon… NA                    NA    75.91.73…
2         2 R_ektOPVZf7… Default Re… Anon… NA                    NA    64.22.25…
3         2 R_uyrfrfz7c… Default Re… Anon… NA                    NA    76.5.191…
4         2 R_cBcEMRtkk… Default Re… Anon… NA                    NA    172.76.4…
5         1 R_2Vs7jWOlP… Default Re… Anon… NA                    NA    67.251.1…
6         2 R_2coorehCL… Default Re… Anon… NA                    NA    73.255.2…
# ℹ 438 more variables: Status <dbl>, StartDate <chr>, StartDateNum <dbl>,
#   EndDate <chr>, Finished <dbl>, confirmCode <dbl>, WorkerID <chr>,
#   IDInst <dbl>, Inst <dbl>, Fake1_S <dbl>, Fake1_2 <dbl>, Fake1_3 <dbl>,
#   Fake1_RT_1 <dbl>, Fake1_RT_2 <dbl>, Fake1_RT_3 <dbl>, Fake1_RT_4 <dbl>,
#   Fake2_S <dbl>, Fake2_2 <dbl>, Fake2_3 <dbl>, Fake2_RT_1 <dbl>,
#   Fake2_RT_2 <dbl>, Fake2_RT_3 <dbl>, Fake2_RT_4 <dbl>, Fake3_S <dbl>,
#   Fake3_2 <dbl>, Fake3_3 <dbl>, Fake3_RT_1 <dbl>, Fake3_RT_2 <dbl>, …

accuracy_raw, veracity

Code
d_long <- d |>
  pivot_longer(
    cols = matches("^(Real|Fake)\\d+_(2(\\.\\d)?|3(\\.\\d)?)$"),  # match _2, _2.0, _2.1, _3, _3.0, _3.1
    names_to = c("veracity", "item", "measure"),
    names_pattern = "^(Real|Fake)(\\d+)_([23](?:\\.\\d)?)$", 
    values_to = "value"
  ) |>
  mutate(
    measure = case_when(
      str_starts(measure, "2") ~ "accuracy",
      str_starts(measure, "3") ~ "sharing"
    )
  ) |>
  # remove all NAs, which inevitably are existant, since each condition
  drop_na(value) |> 
  pivot_wider(names_from = measure, values_from = value) |> 
  rename(accuracy_raw = accuracy)

# check
d_long |>
  group_by(Condition) |>
  summarise(n_participants = n_distinct(WorkerID),
            n_valid_response = sum(!is.na(accuracy_raw)))
# A tibble: 2 × 3
  Condition n_participants n_valid_response
      <dbl>          <int>            <int>
1         1            918            22325
2         2            933            22551

We code the veracity variable.

Code
d_long <- d_long |> 
  mutate(
    veracity = if_else(veracity == "Fake", "false", "true")
    ) 

scale

Code
table(d_long$accuracy_raw, useNA = "always")

    1     2     3     4  <NA> 
16091 11551 12986  4248    55 
Code
d_long <- d_long |>
  mutate(scale = 4)

Conditions (intervention_label, control_label, condition)

In study 2, the conditions are coded as facebook (coded 1), highlight_banner (coded as 2).

Code
# check
d_long |> 
  group_by(Condition) |> 
  summarise(mean(accuracy_raw, na.rm=TRUE))
# A tibble: 2 × 2
  Condition `mean(accuracy_raw, na.rm = TRUE)`
      <dbl>                              <dbl>
1         1                               2.12
2         2                               2.12

We code the condition variable.

Code
d_long <- d_long |>
  mutate(
    intervention_label = case_when(
      Condition == 2 ~ "highlight_banner",
      TRUE ~ NA_character_
    ),
    control_label = case_when(
      Condition == 1 ~ "facebook",
      TRUE ~ NA_character_
    ),
    condition = if_else(Condition == 2, "treatment", "control")
  )

news_id

We have previously coded item, but this is not yet a unique item identifier–these numbers only identify items within each veracity category.

Code
d_long |> 
  group_by(veracity) |> 
  summarise(n_distinct(item))
# A tibble: 2 × 2
  veracity `n_distinct(item)`
  <chr>                 <int>
1 false                    12
2 true                     12

For our news identifier, we therefore combine the veracity variable with these identifiers.

Code
d_long <- d_long |> 
  mutate(news_id = paste0(veracity, "_", item))

age

Code
d_long <- d_long |> 
  mutate(age = Age
         )

year

Code
d_long <- d_long |> 
  mutate(year = year(mdy(StartDate))
         )

Identifiers (subject_id, experiment_id) & removing respondents with mutliple surveys

The original wide format data had 1890 lines, i.e. completed studies. First, we get an overview of candidate variables for participant identifiers.

Code
d_long |> 
  summarize(n_distinct(ResponseID),
            n_distinct(WorkerID))
# A tibble: 1 × 2
  `n_distinct(ResponseID)` `n_distinct(WorkerID)`
                     <int>                  <int>
1                     1890                   1845

It seems like the same workers have in some cases done the survey mulitple times.

Code
d_long |> 
  group_by(WorkerID) |> 
  summarise(n_surveys_taken = n_distinct(ResponseID), 
            n_different_start_dates = n_distinct(StartDate)) |> 
  filter(n_surveys_taken > 1)
# A tibble: 12 × 3
   WorkerID       n_surveys_taken n_different_start_dates
   <chr>                    <int>                   <int>
 1 A1B2GXPMA7YONT              35                       1
 2 A1BL5TRC3DHOHD               2                       1
 3 A1BWO4ZG5OB68S               2                       1
 4 A26DD205RQG4UA               2                       1
 5 A2I88USJQLNT2K               2                       1
 6 A2XVOYY8BDEZXF               2                       2
 7 A3ESURUKHP67K6               2                       1
 8 A3TRL4MZMGU22S               2                       1
 9 A3ZRS4RUCH2OO                2                       2
10 A9T9UBZBWFZTL                2                       1
11 AC777GHVJ8U45                2                       1
12 AUU9JZS6MIQ7C                2                       1

In particular, one worker took the survey 35 times. Unfortunately, we don’t know about the exact time, we only get the day of the survey taken (otherwise we could select the first survey occurence). Not knowing what has been going on there, we will exclude these participants.

Code
d_long_remove_doubles <- d_long |>
  group_by(WorkerID) |>
  filter(n_distinct(ResponseID) == 1) |>
  ungroup()

# check 
n_distinct(d_long_remove_doubles$WorkerID)
[1] 1833
Code
# check 
n_distinct(d_long$WorkerID)
[1] 1845
Code
d2 <- d_long_remove_doubles |> 
  mutate(subject_id = WorkerID, 
         experiment_id = 2) 

Combine Studies

Combine and add identifiers (country, paper_id)

We combine both studies.

Code
## Combine + add remaining variables
dias_2020 <- bind_rows(d1, d2) |> 
  mutate(country = "United States",
         paper_id = "dias_2020") |> 
  # add_intervention_info 
  bind_cols(intervention_info) |> 
  select(any_of(target_variables))

Additional news identifiers (recycled_news, recycled_news_reference)

Since in both studies the same news headlines have been used (with the same labels), we can just keep the labels. We add where the headlines have been taken from.

Code
## Combine + add remaining variables
dias_2020 <- dias_2020 |> 
  mutate(recycled_news = TRUE, 
         recycled_news_reference = "Pennycook, G., Bear, A., Collins, E. T., & Rand, D. G. (2020). The Implied Truth Effect: Attaching Warnings to a Subset of Fake News Headlines Increases Perceived Accuracy of Headlines Without Warnings. Management Science, 66(11), 4944–4957. https://doi.org/10.1287/mnsc.2019.3478") 

news_selection

Code
## Combine + add remaining variables
dias_2020 <- dias_2020 |> 
  mutate(news_selection = "researchers") 

Write out data

Code
save_data(dias_2020)