Dias, Nicholas, Gordon Pennycook, and David G. Rand. 2020. “Emphasizing Publishers Does Not Effectively Reduce Susceptibility to Misinformation on Social Media.”Harvard Kennedy School Misinformation Review, January. https://doi.org/10.37016/mr-2020-001.
Intervention
Code
intervention_info <-tibble(intervention_description ='Study 1: In the control condition, participants saw Facebook-like news posts, with the source domain shwon in gray text. Two treatment conditions: in one the logo of publisher outlet was shown in a bright banner (logo banner); in the other no source was shown (neither gray text, nor logo banner). Study 2: Like study 1, but without the "no source" condition.',control_format ="facebook",control_selection ="facebook",control_selection_description ="For Study 1, we will use the Facebook like condition (facebook) and NOT the condition without a source (no_source) as a control group, since it matches the control group of Study 2, is more comparable with other studies\' control groups, and is closer to real-world settings.",originally_identified_treatment_effect =FALSE)# displayshow_conditions(intervention_info)
intervention_description
control_selection_description
Study 1: In the control condition, participants saw Facebook-like news posts, with the source domain shwon in gray text. Two treatment conditions: in one the logo of publisher outlet was shown in a bright banner (logo banner); in the other no source was shown (neither gray text, nor logo banner). Study 2: Like study 1, but without the "no source" condition.
For Study 1, we will use the Facebook like condition (facebook) and NOT the condition without a source (no_source) as a control group, since it matches the control group of Study 2, is more comparable with other studies' control groups, and is closer to real-world settings.
Notes
Studies 3, 4 and 5 are not relevant, as participants did not provide accuracy ratings, but instead rated the trustworthiness of different sources. Study 6 would in principle be relevant, as it tests an intervention of showcasing the source. However, other than in Studies 1 and 2, in the baseline control condition of Study 6, the text of the news headline was presented in isolation, i.e. plain text and no a source. The treatment effect here is thus not highlighting a source, but adding a source in the first place. Since many other studies use a facebook format where the source is present, and since this format is more realistic in real-world context, we prefer using this as a baseline. We therefore exclude Study 6.
Something that is weird: In Study 1, there is a slightly reduced number of WorkerIDs than there are ResponseIDs, suggesting that some individuals might have taken the survey several times (see below). In Study 2, there is no ResponseID variable, but in the original wide format data there is also one more completed study (i.e. line) than individual WorkerIDs (see below).
To err on the side of caution, we exclude all WorkerIDs with multiple survey takes.
Data Cleaning
Study 1
Code
d <-read_csv("dias_2020-study_1.csv")
Rows: 563 Columns: 602
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): Party_TEXT, IPAddress, StartDate, EndDate, WorkerID, Comments, Le...
dbl (588): Condition, Fake, Real, Fake_C, Real_C, Fake_L, Real_L, Fake_PC, F...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
There is no documentation. But from the stata code the authors provide, we know that:
Columns that, in their endings, contain _2 (like Fake1_2 or Real1_2) represent accuracy ratings made by participants for each news item. It’s a mess, apparently each control condition has their own different outcome variable.
Real and Fake in the titles refer to whether the news item was true (real) or **false (fake`).
We bring the data into long format and build an accuracy outcome column.
Code
d_long <- d |>pivot_longer(cols =matches("^(Real|Fake)\\d+_(2(\\.\\d)?|3(\\.\\d)?)$"), # match _2, _2.0, _2.1, _3, _3.0, _3.1names_to =c("veracity", "item", "measure"),names_pattern ="^(Real|Fake)(\\d+)_([23](?:\\.\\d)?)$", values_to ="value" ) |>mutate(measure =case_when(str_starts(measure, "2") ~"accuracy",str_starts(measure, "3") ~"sharing" ) ) |># remove all NAs, which inevitably are existant, since each conditiondrop_na(value) |>pivot_wider(names_from = measure, values_from = value) |>rename(accuracy_raw = accuracy)# checkd_long |>group_by(Condition) |>summarise(n_participants =n_distinct(WorkerID),n_valid_response =sum(!is.na(accuracy_raw)))
From the stata code we can conclude that in Study 1, the conditions are no_source (coded as 1), facebook (coded 2), highlight_banner (coded as 3). In study 2, the conditions are coded as facebook (coded 1), highlight_banner (coded as 2).
In particular, one worker took the survey 35 times. Unfortunately, we don’t know about the exact time, we only get the day of the survey taken (otherwise we could select the first survey occurence). Not knowing what has been going on there, we will exclude these participants.
Since in both studies the same news headlines have been used (with the same labels), we can just keep the labels. We add where the headlines have been taken from.
Code
## Combine + add remaining variablesdias_2020 <- dias_2020 |>mutate(recycled_news =TRUE, recycled_news_reference ="Pennycook, G., Bear, A., Collins, E. T., & Rand, D. G. (2020). The Implied Truth Effect: Attaching Warnings to a Subset of Fake News Headlines Increases Perceived Accuracy of Headlines Without Warnings. Management Science, 66(11), 4944–4957. https://doi.org/10.1287/mnsc.2019.3478")
---title: Emphasizing Publishers Does Not Effectively Reduce Susceptibility to Misinformation on Social Media.date: "2020"author: - Dias, Nicholascategories: - highlight sourcebibliography: ../../../references.bibnocite: | @diasEmphasizingPublishersDoes2020draft: false ---```{r}#| label: setup#| include: falselibrary(tidyverse)library(kableExtra)library(readxl) # read excel files# load functionssource("../../../R/custom_functions.R")# load target variablessource("../../../R/variables.R")```## Reference::: {#refs}:::## Intervention```{r}intervention_info <-tibble(intervention_description ='Study 1: In the control condition, participants saw Facebook-like news posts, with the source domain shwon in gray text. Two treatment conditions: in one the logo of publisher outlet was shown in a bright banner (logo banner); in the other no source was shown (neither gray text, nor logo banner). Study 2: Like study 1, but without the "no source" condition.',control_format ="facebook",control_selection ="facebook",control_selection_description ="For Study 1, we will use the Facebook like condition (facebook) and NOT the condition without a source (no_source) as a control group, since it matches the control group of Study 2, is more comparable with other studies\' control groups, and is closer to real-world settings.",originally_identified_treatment_effect =FALSE)# displayshow_conditions(intervention_info)```### NotesStudies 3, 4 and 5 are not relevant, as participants did not provide accuracy ratings, but instead rated the trustworthiness of different sources. Study 6 would in principle be relevant, as it tests an intervention of showcasing the source. However, other than in Studies 1 and 2, in the baseline control condition of Study 6, the text of the news headline was presented in isolation, i.e. plain text and no a source. The treatment effect here is thus not highlighting a source, but adding a source in the first place. Since many other studies use a facebook format where the source is present, and since this format is more realistic in real-world context, we prefer using this as a baseline. We therefore exclude Study 6. Something that is weird: In Study 1, there is a slightly reduced number of WorkerIDs than there are ResponseIDs, suggesting that some individuals might have taken the survey several times (see below). In Study 2, there is no ResponseID variable, but in the original wide format data there is also one more completed study (i.e. line) than individual WorkerIDs (see below).To err on the side of caution, we exclude all WorkerIDs with multiple survey takes. ## Data Cleaning### Study 1```{r}d <-read_csv("dias_2020-study_1.csv") head(d)```#### `accuracy_raw`, `veracity`There is no documentation. But from the stata code the authors provide, we know that: - Columns that, in their endings, contain `_2` (like Fake1_2 or Real1_2) represent accuracy ratings made by participants for each news item. It's a mess, apparently each control condition has their own different outcome variable. - Real and Fake in the titles refer to whether the news item was true (real) or **false (fake`).We bring the data into long format and build an accuracy outcome column.```{r}d_long <- d |>pivot_longer(cols =matches("^(Real|Fake)\\d+_(2(\\.\\d)?|3(\\.\\d)?)$"), # match _2, _2.0, _2.1, _3, _3.0, _3.1names_to =c("veracity", "item", "measure"),names_pattern ="^(Real|Fake)(\\d+)_([23](?:\\.\\d)?)$", values_to ="value" ) |>mutate(measure =case_when(str_starts(measure, "2") ~"accuracy",str_starts(measure, "3") ~"sharing" ) ) |># remove all NAs, which inevitably are existant, since each conditiondrop_na(value) |>pivot_wider(names_from = measure, values_from = value) |>rename(accuracy_raw = accuracy)# checkd_long |>group_by(Condition) |>summarise(n_participants =n_distinct(WorkerID),n_valid_response =sum(!is.na(accuracy_raw)))```We code the veracity variable. ```{r}d_long <- d_long |>mutate(veracity =if_else(veracity =="Fake", "false", "true") ) ```#### `scale````{r}table(d_long$accuracy_raw, useNA ="always")``````{r}d_long <- d_long |>mutate(scale =4)```#### Conditions (`intervention_label`, `control_label`, `condition`)From the stata code we can conclude that in Study 1, the conditions are `no_source` (coded as 1), `facebook` (coded 2), `highlight_banner` (coded as 3). In study 2, the conditions are coded as `facebook` (coded 1), `highlight_banner` (coded as 2).```{r}# checkd_long |>group_by(Condition) |>summarise(mean(accuracy_raw, na.rm=TRUE))```We code the condition variable. ```{r}d_long <- d_long |>mutate(intervention_label =case_when( Condition ==3~"highlight_banner",TRUE~NA_character_ ),control_label =case_when( Condition ==1~"no_source", Condition ==2~"facebook",TRUE~NA_character_ ),condition =if_else(Condition ==3, "treatment", "control") )```#### `news_id`We have previously coded `item`, but this is not yet a unique item identifier--these numbers only identify items within each veracity category.```{r}d_long |>group_by(veracity) |>summarise(n_distinct(item))```For our news identifier, we therefore combine the veracity variable with these identifiers.```{r}d_long <- d_long |>mutate(news_id =paste0(veracity, "_", item))```#### `age````{r}d_long <- d_long |>mutate(age = Age )```#### `year````{r}d_long <- d_long |>mutate(year =year(mdy(StartDate)) )```#### Identifiers (`subject_id`, `experiment_id`) & removing respondents with mutliple surveysThe original wide format data had `r nrow(d)` lines, i.e. completed studies. There is one worker with two surveys, which we will exclude. ```{r}d |>group_by(WorkerID) |>summarise(n_surveys_taken =n(), n_different_start_dates =n_distinct(StartDate)) |>filter(n_surveys_taken >1)``````{r}d_long_remove_doubles <- d_long |>filter(WorkerID !="A15LHHN76OW2UM")# check n_distinct(d_long_remove_doubles$WorkerID)# check n_distinct(d_long$WorkerID)``````{r}d1 <- d_long_remove_doubles |>mutate(subject_id = WorkerID, experiment_id =1) ```### Study 2```{r}d <-read_csv("dias_2020-study_2.csv") head(d)```#### `accuracy_raw`, `veracity````{r}d_long <- d |>pivot_longer(cols =matches("^(Real|Fake)\\d+_(2(\\.\\d)?|3(\\.\\d)?)$"), # match _2, _2.0, _2.1, _3, _3.0, _3.1names_to =c("veracity", "item", "measure"),names_pattern ="^(Real|Fake)(\\d+)_([23](?:\\.\\d)?)$", values_to ="value" ) |>mutate(measure =case_when(str_starts(measure, "2") ~"accuracy",str_starts(measure, "3") ~"sharing" ) ) |># remove all NAs, which inevitably are existant, since each conditiondrop_na(value) |>pivot_wider(names_from = measure, values_from = value) |>rename(accuracy_raw = accuracy)# checkd_long |>group_by(Condition) |>summarise(n_participants =n_distinct(WorkerID),n_valid_response =sum(!is.na(accuracy_raw)))```We code the veracity variable. ```{r}d_long <- d_long |>mutate(veracity =if_else(veracity =="Fake", "false", "true") ) ```#### `scale````{r}table(d_long$accuracy_raw, useNA ="always")``````{r}d_long <- d_long |>mutate(scale =4)```#### Conditions (`intervention_label`, `control_label`, `condition`)In study 2, the conditions are coded as `facebook` (coded 1), `highlight_banner` (coded as 2).```{r}# checkd_long |>group_by(Condition) |>summarise(mean(accuracy_raw, na.rm=TRUE))```We code the condition variable. ```{r}d_long <- d_long |>mutate(intervention_label =case_when( Condition ==2~"highlight_banner",TRUE~NA_character_ ),control_label =case_when( Condition ==1~"facebook",TRUE~NA_character_ ),condition =if_else(Condition ==2, "treatment", "control") )```#### `news_id`We have previously coded `item`, but this is not yet a unique item identifier--these numbers only identify items within each veracity category.```{r}d_long |>group_by(veracity) |>summarise(n_distinct(item))```For our news identifier, we therefore combine the veracity variable with these identifiers.```{r}d_long <- d_long |>mutate(news_id =paste0(veracity, "_", item))```#### `age````{r}d_long <- d_long |>mutate(age = Age )```#### `year````{r}d_long <- d_long |>mutate(year =year(mdy(StartDate)) )```#### Identifiers (`subject_id`, `experiment_id`) & removing respondents with mutliple surveysThe original wide format data had `r nrow(d)` lines, i.e. completed studies. First, we get an overview of candidate variables for participant identifiers.```{r}d_long |>summarize(n_distinct(ResponseID),n_distinct(WorkerID))```It seems like the same workers have in some cases done the survey mulitple times. ```{r}d_long |>group_by(WorkerID) |>summarise(n_surveys_taken =n_distinct(ResponseID), n_different_start_dates =n_distinct(StartDate)) |>filter(n_surveys_taken >1)```In particular, one worker took the survey 35 times. Unfortunately, we don't know about the exact time, we only get the day of the survey taken (otherwise we could select the first survey occurence). Not knowing what has been going on there, we will exclude these participants.```{r}d_long_remove_doubles <- d_long |>group_by(WorkerID) |>filter(n_distinct(ResponseID) ==1) |>ungroup()# check n_distinct(d_long_remove_doubles$WorkerID)# check n_distinct(d_long$WorkerID)``````{r}d2 <- d_long_remove_doubles |>mutate(subject_id = WorkerID, experiment_id =2) ```### Combine Studies#### Combine and add identifiers (`country`, `paper_id`)We combine both studies. ```{r}## Combine + add remaining variablesdias_2020 <-bind_rows(d1, d2) |>mutate(country ="United States",paper_id ="dias_2020") |># add_intervention_info bind_cols(intervention_info) |>select(any_of(target_variables))```#### Additional news identifiers (`recycled_news`, `recycled_news_reference`)Since in both studies the same news headlines have been used (with the same labels), we can just keep the labels. We add where the headlines have been taken from. ```{r}## Combine + add remaining variablesdias_2020 <- dias_2020 |>mutate(recycled_news =TRUE, recycled_news_reference ="Pennycook, G., Bear, A., Collins, E. T., & Rand, D. G. (2020). The Implied Truth Effect: Attaching Warnings to a Subset of Fake News Headlines Increases Perceived Accuracy of Headlines Without Warnings. Management Science, 66(11), 4944–4957. https://doi.org/10.1287/mnsc.2019.3478") ```#### `news_selection````{r}## Combine + add remaining variablesdias_2020 <- dias_2020 |>mutate(news_selection ="researchers") ```## Write out data```{r}save_data(dias_2020)```