A Digital Media Literacy Intervention Increases Discernment Between Mainstream and False News in the United States and India.

literacy
Author

Guess, Andrew M.

Published

2020

Reference

Guess, Andrew M., Michael Lerner, Benjamin Lyons, Jacob M. Montgomery, Brendan Nyhan, Jason Reifler, and Neelanjan Sircar. 2020. “A Digital Media Literacy Intervention Increases Discernment Between Mainstream and False News in the United States and India.” Proceedings of the National Academy of Sciences 117 (27): 15536–45. https://doi.org/10.1073/pnas.1920498117.

Intervention

Code
intervention_info <- tibble(
    intervention_description = 'In both studies, participants were randomly assigned either to a control group, or a media literacy intervention group. In the intervention group, participants would read general tips on how detect misinformation (e.g.: "Some stories are intentionally false. Think critically about the stories you read, and only share news that you know to be credible.")',
    intervention_selection = "literacy",
    originally_identified_treatment_effect = TRUE
      )

# display
show_conditions(intervention_info)
intervention_description
In both studies, participants were randomly assigned either to a control group, or a media literacy intervention group. In the intervention group, participants would read general tips on how detect misinformation (e.g.: "Some stories are intentionally false. Think critically about the stories you read, and only share news that you know to be credible.")

Notes

This is a two-wave study. In the US, the same items have been used in Wave 1 and 2. Therefore, only Wave 1 is relevant. In India, however, two different sets of items have been used, making both waves relevant.

“Unlike the US study (where the same headlines were used in both waves 1 and 2 to test for prior exposure effects), we used different sets of headlines in each wave.”

The format was different in the US studies, they used a face-book like format, but no lede (headline, picture and source). In India there was no picture in India:

Respondents were presented with the headline in text format in the online survey, while enumerators read the headlines to respondents in the face-to-face survey.

It is ambiguous whether sources were shown or not, but not plausible at least in the face-to-face condition. We will, in doubt, code no source.

The authors identify a treatment effect on discernment:

“Strikingly, our results indicate that exposure to variants of the Facebook media literacy intervention reduces people’s belief in false headlines. These effects are not only an artifact of greater skepticism toward all information—although the perceived accuracy of mainstream news headlines slightly decreased, exposure to the intervention widened the gap in perceived accuracy between mainstream and false news headlines overall.”

Data Cleaning

Study 1 (United States)

Code
d <- read_dta("guess_2020-US.dta")
head(d)
# A tibble: 6 × 511
     caseid weight headline_1_article_name          headline_2_article_name     
      <dbl>  <dbl> <dbl+lbl>                        <dbl+lbl>                   
1 389941943  0.768 12 [and_now_its_the_tallest_png]  1 [donald_trump_caught_png]
2 395525876  0.988  2 [franklin_graham_png]         13 [google_employees_png]   
3 397441725  0.965 13 [google_employees_png]         1 [donald_trump_caught_png]
4 403011277  0.803 13 [google_employees_png]         2 [franklin_graham_png]    
5 419842674  1.44  16 [economy_adds_more_png]        6 [kavanaugh_accuser_png]  
6 422239823  0.578  8 [lisa_page_png]               12 [and_now_its_the_tallest…
# ℹ 507 more variables: headline_3_article_name <dbl+lbl>,
#   headline_4_article_name <dbl+lbl>, headline_5_article_name <dbl+lbl>,
#   headline_6_article_name <dbl+lbl>, headline_7_article_name <dbl+lbl>,
#   headline_8_article_name <dbl+lbl>, instructions_treat <dbl+lbl>,
#   consent <dbl+lbl>, inputstate <dbl+lbl>, ideo <dbl+lbl>, pid3 <dbl+lbl>,
#   pid3_t <chr>, pid7 <dbl+lbl>, pol_interest <dbl+lbl>,
#   trump_approve <dbl+lbl>, pol_therm_dem <dbl+lbl>, …

veracity, accuracy_raw, news_id, news_slant

There is a ton of candidate variables. The best shot, since we don’t know abou the veracity of the variables yet, is to take the named ones.

Code
# accuracy ratings by headline
headlines <- c(
  "accuracy_donald_trump_caught",     # Pro-D hyperpartisan
  "accuracy_franklin_graham",         # Pro-D hyperpartisan
  "accuracy_vp_mike_pence",           # Pro-D false
  "accuracy_vice_president_pence",    # Pro-D false
  "accuracy_soros_money_behind",      # Pro-R hyperpartisan
  "accuracy_kavanaugh_accuser",       # Pro-R hyperpartisan
  "accuracy_fbi_agent_who",           # Pro-R false
  "accuracy_lisa_page",               # Pro-R false
  "accuracy_a_series1",               # Pro-D mainstream (low)
  "accuracy_a_border_patrol",         # Pro-D mainstream (low)
  "accuracy_detention_of_migrant",    # Pro-D mainstream (high)
  "accuracy_and_now1",                # Pro-D mainstream (high)
  "accuracy_google_employees",        # Pro-R mainstream (low)
  "accuracy_feds_said_alleged",       # Pro-R mainstream (low)
  "accuracy_small_busisness_opt",     # Pro-R mainstream (high)
  "accuracy_economy_adds_more"        # Pro-R mainstream (high)
)

# check
table(d$accuracy_donald_trump_caught, useNA = "always")

   1    2    3    4 <NA> 
1221  640  370  196 2480 

The fact that the same variables exist with the suffix “w2”, suggests that these are containing the ratings for Wave 1 that we are looking for.

The next step is to qualitatively match the rather cryptic variable names to the headlines given in the appendix:

  • Pro-Democrat Hyperpartisan News

    Pro-D hyper 1: Donald Trump caught privately wishing he’d sided more thoroughly with white supremacists. (accuracy_donald_trump_caught)

    Pro-D hyper 2: Franklin Graham: Attempted rape not a crime. Kavanaugh ‘respected’ his victim by not finishing. (accuracy_franklin_graham)

  • Pro-Democrat False News

    Pro-D false 1: VP Mike Pence Busted Stealing Campaign Funds To Pay His Mortgage Like A Thief. (accuracy_vp_mike_pence)

    Pro-D false 2: Vice President Pence now being investigated for campaign fraud, his ties to Russia and Manafort. (accuracy_vice_president_pence)

  • Pro-Republican Hyperpartisan News

    Pro-R hyper 1: Soros Money Behind ‘Black Political Power’ Outfit Supporting Andrew Gillum in Florida. (accuracy_soros_money_behind)

    Pro-R hyper 2: Kavanaugh Accuser Christine Blasey Exposed For Ties To Big Pharma Abortion Pill Maker. Effort To Derail Kavanaugh Is Plot To Protect Abortion Industry Profits. (accuracy_kavanaugh_accuser)

  • Pro-Republican False News

    Pro-R false 1: Special Agent David Raynor was due to testify against Hillary Clinton when he died. (accuracy_fbi_agent_who)

    Pro-R false 2: Lisa Page Squeals: DNC Server Was Not Hacked By Russia. (accuracy_lisa_page)

  • Mainstream News Congenial to Democrats (Low-Prominence Source)

    Pro-D Mainstream 1: A Series Of Suspicious Money Transfers Followed The Trump Tower Meeting. (accuracy_a_series1)

    Pro-D Mainstream 2: A Border Patrol Agent Has Been Called a ‘Serial Killer’ by Police After Murdering 4 Women. (accuracy_a_border_patrol)

  • Mainstream News Congenial to Democrats (High-Prominence Source)

    Pro-D Mainstream 3: Detention of Migrant Children Has Skyrocketed to Highest Levels Ever. (accuracy_detention_of_migrant)

    Pro-D Mainstream 4: ‘And now it’s the tallest’: Trump, in otherwise sombre 9/11 interview, couldn’t help touting one of his buildings. (accuracy_and_now1)

  • Mainstream News Congenial to Republicans (Low-Prominence Source)

    Pro-R Mainstream 1: Google Workers Discussed Tweaking Search Function to Counter Travel Ban. (accuracy_google_employees)

    Pro-R Mainstream 2: Feds said alleged Russian spy Maria Butina used sex for influence. Now, they’re walking that back. (accuracy_feds_said_alleged)

  • Mainstream News Congenial to Republicans (High-Prominence Source)

    Pro-R Mainstream 3: Small business optimism surges to highest level ever, topping previous record under Reagan. (accuracy_small_busisness_opt)

    Pro-R Mainstream 4: Economy adds more jobs than expected in August, and wage growth hits postrecession high. (accuracy_economy_adds_more)

We first code a lookup table for headline name, veracity, and political slant.

Code
# Create a lookup table
headline_info <- tibble(
  news_id = headlines,
  news_slant = c(
    rep("democrat", 4),  # first 4 are Pro-D
    rep("republican", 4),# next 4 are Pro-R
    rep("democrat", 4),  # next 4 are Pro-D
    rep("republican", 4) # last 4 are Pro-R
  ),
  veracity = c(
    "hyperpartisan", "hyperpartisan", "false", "false",     # Pro-D
    "hyperpartisan", "hyperpartisan", "false", "false",     # Pro-R
    "true", "true", "true", "true",                         # Pro-D mainstream
    "true", "true", "true", "true"                          # Pro-R mainstream
  )
)

We then reshape the data and add veracity and news slant based on the lookup table.

Code
# Now pivot and join
d_long <- d |> 
  pivot_longer(
    cols = all_of(headlines),
    names_to = "news_id",
    values_to = "accuracy_raw"
  ) |> 
  left_join(headline_info, by = "news_id") |> 
  # remove the "accuracy_" prefix
  mutate(news_id = sub("^accuracy_", "", news_id))

# check
# d_long |> 
#   group_by(news_id, veracity, news_slant) |> 
#   summarize(mean(accuracy_raw, na.rm=TRUE))

# plausibility check
# d_long |> 
#   group_by(veracity) |> 
#   summarize(mean(accuracy_raw, na.rm=TRUE))

We remove the hyperpartisan items.

Code
d_long <- d_long |> 
  filter(veracity != "hyperpartisan")

Conditions (intervention_label, condition)

From an e-mail exchange with the first author, we know that the treatment variable is tips, where ‘0’ corresponds to control and ‘1’ corresponds to the literacy intervention.

Code
table(d_long$tips)

    0     1 
29436 29448 
Code
d_long <- d_long |> 
  mutate(condition = ifelse(tips == 0, "control", "treatment"), 
         intervention_label = "literacy"
         )

scale

Code
d_long <- d_long |> 
  mutate(scale = 4)

news_selection

Code
d_long <- d_long |> 
  mutate(news_selection = "researchers")

age

There is already an age variable.

Code
table(d_long$age)

  18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33 
 552  456  708  828  720  660  792  912  996 1068 1212 1140  816  924 1164 1128 
  34   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49 
 936 1128 1068 1020 1152 1152 1140  816  864  828 1008  660  564  768  984 1020 
  50   51   52   53   54   55   56   57   58   59   60   61   62   63   64   65 
 840  984  888  936 1104 1224 1356 1104 1116 1248 1392 1368 1188 1356 1152 1200 
  66   67   68   69   70   71   72   73   74   75   76   77   78   79   80   81 
1080 1056 1128  792 1092  816  804  600  540  576  588  420  456  300  216  192 
  82   83   84   85   87   88   89   90   93 
 156   96   96   84   48   48   36   12   12 

year

Code
d_long <- d_long |> 
  mutate(year = year(ymd_hms(starttime)))

# check
# d_long |> 
#   select(starttime, year)

# check
table(d_long$year)

 2018 
58884 

Concordance (concordance, partisan_identity)

Check the value labels

Code
val_labels(d$pid3)
   Democrat  Republican Independent       Other    Not sure     skipped 
          1           2           3           4           5           8 
  not asked 
          9 
Code
d_long <- d_long |> 
  mutate(partisan_identity = tolower(as_factor(pid3)),
         # make everything that is not democrat or republican NA
         partisan_identity = ifelse(partisan_identity %in% c("democrat", "republican"), 
                                    partisan_identity, 
                                    NA),
         # Make concordance variable
         concordance = ifelse(partisan_identity == news_slant, "concordant", "discordant")
  )

# check
d_long |> 
  select(partisan_identity, pid3, news_slant, concordance)
# A tibble: 58,884 × 4
   partisan_identity pid3            news_slant concordance
   <chr>             <dbl+lbl>       <chr>      <chr>      
 1 <NA>              3 [Independent] democrat   <NA>       
 2 <NA>              3 [Independent] democrat   <NA>       
 3 <NA>              3 [Independent] republican <NA>       
 4 <NA>              3 [Independent] republican <NA>       
 5 <NA>              3 [Independent] democrat   <NA>       
 6 <NA>              3 [Independent] democrat   <NA>       
 7 <NA>              3 [Independent] democrat   <NA>       
 8 <NA>              3 [Independent] democrat   <NA>       
 9 <NA>              3 [Independent] republican <NA>       
10 <NA>              3 [Independent] republican <NA>       
# ℹ 58,874 more rows

Identifiers (subject_id, experiment_id, country) and control_format

Check candidate variable for subject identifier.

Code
n_distinct(d_long$caseid)
[1] 4907

This corresponds to the number reported in the paper.

Code
d1 <- d_long |> 
  mutate(subject_id = caseid, 
         experiment_id = 1, 
         country = "United States",
         control_format = "picture, source")

Study 2 (India, face-to-face)

Code
d <- read_dta("guess_2020-India_facetoface.dta")
head(d)
# A tibble: 6 × 89
  survey_date  caseid `_11_gaccuracy`  male low_caste college whatsapp
  <chr>         <dbl>           <dbl> <dbl>     <dbl>   <dbl>    <dbl>
1 Apr 24, 2019   1001            3        1         1       0        0
2 Apr 24, 2019   1002            5        1         1       0        0
3 Apr 24, 2019   1003            3        1         1       0        0
4 Apr 24, 2019   1004            4.10     1         1       0        0
5 Apr 24, 2019   1005            3        0         1       0        0
6 Apr 24, 2019   1006            5        1         1       0        0
# ℹ 82 more variables: days_whatsapp <dbl>, hindu <dbl>, muslim <dbl>,
#   tips <dbl>, placebo_assigned <dbl>, factcheck_assigned <dbl>,
#   placebo_factcheck <dbl>, BSP_feelings <dbl>, BJP_feelings <dbl>,
#   INC_feelings <dbl>, SP_feelings <dbl>, bjp_support <dbl>, bjp_oppose <dbl>,
#   accuracy_modi_stone <dbl>, accuracy_gandhi_pune <dbl>,
#   accuracy_india_jobs <dbl>, accuracy_congress_riots <dbl>,
#   accuracy_modi_kumbh <dbl>, accuracy_congress_pakistan <dbl>, …

veracity, accuracy_raw, news_id, news_slant

For the studies in India, we know that:

Finally, 4 additional false headlines were included in the second wave based on fact checks conducted between the two waves. In total, respondents rated 12 headlines in wave 1 (6 false and 6 true) and 16 in wave 2 (10 false and 6 true).

We also know that (appendix):

Both Wave 1 and Wave 2 included both mainstream and false headlines that were either congenial to Bharatiya Janata Party (BJP) supporters or congenial to BJP opponents as well as headlines pertaining to nationalism issues (either India-Pakistan or HinduMuslim relations).

We have matched cryptic variable names and news headlines in an external .csv. In the documentation of the study, it is not exactly clear, but “FTF” and “MTurk” characterize likely four different items of the second wave, the former being used in the face-to-face, the latter being used in the online survey.

Finally, 4 additional false headlines were included in the second wave based on fact checks conducted between the two waves.

Code
headline_info <- read_delim("india_headlines.csv", delim = ";")
Rows: 32 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (3): news_id, headline, news_slant
dbl (1): wave
lgl (1): veracity

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
head(headline_info)
# A tibble: 6 × 5
  news_id                    headline                  news_slant veracity  wave
  <chr>                      <chr>                     <chr>      <lgl>    <dbl>
1 accuracy_modi_stone        Modi lays foundation sto… Pro-BJP    TRUE         1
2 accuracy_gandhi_pune       Rahul Gandhi greeted wit… Pro-BJP    TRUE         1
3 accuracy_india_jobs        Govt tried to suppress d… Anti-BJP   TRUE         1
4 accuracy_congress_riots    Study: More riots if Con… Anti-BJP   TRUE         1
5 accuracy_modi_kumbh        Modi first head of state… Pro-BJP    FALSE        1
6 accuracy_congress_pakistan Congress workers chant “… Pro-BJP    FALSE        1

We then reshape the data and add veracity and news slant based on the lookup table.

Code
# Now pivot and join
d_long <- d |> 
  pivot_longer(
    cols = any_of(headline_info$news_id),
    names_to = "news_id",
    values_to = "accuracy_raw"
  ) |> 
  left_join(headline_info, by = "news_id") |> 
  mutate(
    # remove the "accuracy_" prefix
    news_id = sub("^accuracy", "", news_id), 
    # give correct veracity values
    veracity = ifelse(veracity == TRUE, "true", "false")
    )

# check
# d_long |> 
#   group_by(news_id, veracity, news_slant) |> 
#   summarize(mean(accuracy_raw, na.rm=TRUE))

# plausibility check
# d_long |> 
#   group_by(veracity) |> 
#   summarize(mean(accuracy_raw, na.rm=TRUE))

long_term, time_elapsed

The news ratings of the second wave are more distant in time, in order to evaluate long-term effects of the intervention. We therefore want to separate them (and not consider them in our main analyses).

As for the elapsed time between intervention and the follow-up evaluation, we know that:

“In the online survey, we collected survey data from a national convenience sample of Hindi-speaking Indians recruited via Mechanical Turk and the Internet Research Bureau’s Online Bureau survey panels (wave 1, April 17 to May 1, 2019, N = 3, 273; wave 2, May 13 to 19, 2019, N = 1, 369).”

We calculate the average time between these.

Code
# May 16 − April 24 = 22 days
average_time_elapsed <- 22

Check the different news_ids.

Code
d_long |> 
  distinct(news_id)
# A tibble: 28 × 1
   news_id           
   <chr>             
 1 _modi_stone       
 2 _gandhi_pune      
 3 _india_jobs       
 4 _congress_riots   
 5 _modi_kumbh       
 6 _congress_pakistan
 7 _modi_court       
 8 _blair_gandhi     
 9 _iaf_pakistan     
10 _water_pakistan   
# ℹ 18 more rows

Label those containing “w2” as long term effect measures.

Code
d_long <- d_long |> 
  mutate(
    # use news_id labels as identifiers
    long_term = ifelse(str_detect(news_id, "w2"), TRUE, FALSE), 
    time_elapsed = average_time_elapsed
    )

# check
table(d_long$long_term, useNA = "always")

FALSE  TRUE  <NA> 
44928 59904     0 

Conditions (intervention_label, condition)

From an e-mail exchange with the first author, we know that the treatment variable is tips, where ‘0’ corresponds to control and ‘1’ corresponds to the literacy intervention.

Code
table(d_long$tips)

    0     1 
53704 51128 
Code
# plausbility check
# d_long |> 
#   group_by(tips, veracity) |> 
#   summarize(mean(accuracy_raw, na.rm=TRUE))
Code
d_long <- d_long |> 
  mutate(condition = ifelse(tips == 0, "control", "treatment"), 
         intervention_label = "literacy"
         )

scale

Code
d_long <- d_long |> 
  mutate(scale = 4)

news_selection

Code
d_long <- d_long |> 
  mutate(news_selection = "researchers")

age

There is only an agegroup variable, but it is unclear what the groups correspond to. We therefor code no age variable.

Code
table(d$agegroup)

  1   2   3   4 
819 839 517 306 

year

Code
d_long <- d_long |> 
  mutate(year = year(mdy(survey_date)))

# check
# d_long |>
#   select(survey_date, year)

# check
table(d_long$year)

  2008   2014   2015   2016   2018   2019   2024 
    28     28    112     56     28 104524     28 

Since there likely are coding issues, we’ll just put “2019”.

Code
d_long <- d_long |> 
  mutate(year = 2019)

Concordance (concordance, partisan_identity)

It seems the two relevant variables for partisan support are bjp_support and bjp_oppose, with 0 meaning FALSE and 1 TRUE.

Code
d_long |> 
  group_by(bjp_support, bjp_oppose) |> 
  summarize(n = n_distinct(caseid))
`summarise()` has grouped output by 'bjp_support'. You can override using the
`.groups` argument.
# A tibble: 3 × 3
# Groups:   bjp_support [2]
  bjp_support bjp_oppose     n
        <dbl>      <dbl> <int>
1           0          0  1022
2           0          1   997
3           1          0  1725

We make a single variable out of these, and match it with the news_slant variable.

Code
table(d_long$news_slant)

   Anti-BJP         FTF Nationalist     Pro-BJP 
      29952       14976       29952       29952 
Code
d_long<- d_long %>% 
  # make a binary variable indicating political slant of news
  mutate(# make a clearer party id variable (goes from the most specific to the most general)
         partisan_identity = case_when(bjp_support == 0 & bjp_oppose == 0 ~ NA_character_,
                                       bjp_support == 0 ~ "non_BJP", 
                                       bjp_support == 1 ~ "BJP"),
         # combine party id and political slant 
         concordance = case_when(news_slant == "Pro-BJP" & partisan_identity == "BJP" ~ "concordant",
                                 news_slant == "Anti-BJP" & partisan_identity == "non_BJP" ~ "concordant", 
                                 news_slant == "Pro-BJP" & partisan_identity == "non_BJP" ~ "discordant",
                                 news_slant == "Anti-BJP" & partisan_identity == "BJP" ~ "discordant", 
                                 TRUE ~ NA_character_)
  )

# check
# d_long |> 
#   select(partisan_identity, news_slant, concordance)

Identifiers (subject_id, experiment_id, country) and control_format

Check candidate variable for subject identifier.

Code
n_distinct(d_long$caseid)
[1] 3744

This corresponds to the number reported in the paper.

Code
d2 <- d_long |> 
  mutate(subject_id = caseid, 
         experiment_id = 2, 
         country = "India")

Study 3 (India, online)

Code
d <- read_dta("guess_2020-India_online.dta")
head(d)
# A tibble: 6 × 96
  StartDate           EndDate             ResponseId mid    tips  male low_caste
  <dttm>              <dttm>              <chr>      <chr> <dbl> <dbl>     <dbl>
1 2019-04-30 04:25:24 2019-04-30 04:31:01 R_273QTwi… A101…     0     1         1
2 2019-04-18 11:34:34 2019-04-18 11:43:15 R_2uIkP3i… A10C…     0     1         1
3 2019-04-22 00:10:06 2019-04-22 01:23:20 R_241K9Cw… A10N…     1     0         0
4 2019-04-21 21:57:06 2019-04-21 22:36:15 R_27I1VPP… A10S…     1     1         1
5 2019-04-22 21:58:32 2019-04-22 22:10:55 R_3QMGscu… A10Z…     0     1         0
6 2019-04-23 00:02:35 2019-04-23 00:33:28 R_3O6I9VC… A112…     1     1         0
# ℹ 89 more variables: college <dbl>, hindu <dbl>, muslim <dbl>,
#   whatsapp <dbl>, days_whatsapp <dbl>, birthyear <dbl>, BSP_feelings <dbl>,
#   BJP_feelings <dbl>, INC_feelings <dbl>, SP_feelings <dbl>,
#   bjp_support <dbl>, bjp_oppose <dbl>, pure_control <dbl>,
#   placebo_assigned <dbl>, factcheck_assigned <dbl>,
#   control_placebo_factcheck <dbl>, accuracy_modi_stone <dbl>,
#   accuracy_gandhi_pune <dbl>, accuracy_india_jobs <dbl>, …

veracity, accuracy_raw, news_id, news_slant

We proceed as previously in Study 2, since the headlines are the same.

Code
headline_info <- read_delim("india_headlines.csv", delim = ";")
Rows: 32 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (3): news_id, headline, news_slant
dbl (1): wave
lgl (1): veracity

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
head(headline_info)
# A tibble: 6 × 5
  news_id                    headline                  news_slant veracity  wave
  <chr>                      <chr>                     <chr>      <lgl>    <dbl>
1 accuracy_modi_stone        Modi lays foundation sto… Pro-BJP    TRUE         1
2 accuracy_gandhi_pune       Rahul Gandhi greeted wit… Pro-BJP    TRUE         1
3 accuracy_india_jobs        Govt tried to suppress d… Anti-BJP   TRUE         1
4 accuracy_congress_riots    Study: More riots if Con… Anti-BJP   TRUE         1
5 accuracy_modi_kumbh        Modi first head of state… Pro-BJP    FALSE        1
6 accuracy_congress_pakistan Congress workers chant “… Pro-BJP    FALSE        1

We then reshape the data and add veracity and news slant based on the lookup table.

Code
# Now pivot and join
d_long <- d |> 
  pivot_longer(
    cols = any_of(headline_info$news_id),
    names_to = "news_id",
    values_to = "accuracy_raw"
  ) |> 
  left_join(headline_info, by = "news_id") |> 
  mutate(
    # remove the "accuracy_" prefix
    news_id = sub("^accuracy", "", news_id), 
    # give correct veracity values
    veracity = ifelse(veracity == TRUE, "true", "false")
    )

# check
d_long |>
  group_by(news_id, veracity, news_slant) |>
  summarize(mean(accuracy_raw, na.rm=TRUE))
`summarise()` has grouped output by 'news_id', 'veracity'. You can override
using the `.groups` argument.
# A tibble: 28 × 4
# Groups:   news_id, veracity [28]
   news_id            veracity news_slant  `mean(accuracy_raw, na.rm = TRUE)`
   <chr>              <chr>    <chr>                                    <dbl>
 1 _blair_gandhi      false    Anti-BJP                                  2.09
 2 _congress_pakistan false    Pro-BJP                                   2.21
 3 _congress_riots    true     Anti-BJP                                  2.50
 4 _gandhi_pune       true     Pro-BJP                                   2.67
 5 _iaf_pakistan      true     Nationalist                               2.91
 6 _india_jobs        true     Anti-BJP                                  2.61
 7 _kashmir_hindu     false    Nationalist                               2.56
 8 _modi_court        false    Anti-BJP                                  2.14
 9 _modi_kumbh        false    Pro-BJP                                   2.62
10 _modi_stone        true     Pro-BJP                                   2.81
# ℹ 18 more rows
Code
# plausibility check
# d_long |> 
#   group_by(veracity) |> 
#   summarize(mean(accuracy_raw, na.rm=TRUE))

long_term, time_elapsed

The news ratings of the second wave are more distant in time, in order to evaluate long-term effects of the intervention. We therefore want to separate them (and not consider them in our main analyses).

As for the elapsed time between intervention and the follow-up evaluation, we know that:

“The India face-to-face survey was conducted by the polling firm Morsel in Barabanki, Bahraich, Domariyaganj, and Shrawasti, four parliamentary constituencies in the state of Uttar Pradesh where Hindi is the dominant language (wave 1, April 13 to May 2, 2019, N = 3, 744; wave 2, May 7 to 19, 2019, N = 2,695).”

We calculate the average time between these.

Code
# May 13, 2019 − April 22, 2019 = 21 days
average_time_elapsed <- 21
Code
d_long <- d_long |> 
  mutate(
    # use news_id labels as identifiers
    long_term = ifelse(str_detect(news_id, "w2"), TRUE, FALSE), 
    time_elapsed = average_time_elapsed
    )

Conditions (intervention_label, condition)

From an e-mail exchange with the first author, we know that the treatment variable is tips, where ‘0’ corresponds to control and ‘1’ corresponds to the literacy intervention.

Code
table(d_long$tips)

    0     1 
46312 45332 
Code
d_long <- d_long |> 
  mutate(condition = ifelse(tips == 0, "control", "treatment"), 
         intervention_label = "literacy"
         )

scale

Code
d_long <- d_long |> 
  mutate(scale = 4)

news_selection

Code
d_long <- d_long |> 
  mutate(news_selection = "researchers")

age

There is only an agegroup variable, but it is unclear what the groups correspond to. We therefor code no age variable.

Code
table(d$agegroup)

   1    2    3    4 
1564 1157  203  349 

year

Code
d_long <- d_long |> 
  mutate(year = year(ymd_hms(StartDate)))

# check
# d_long |>
#   select(StartDate, year)

Concordance (concordance, partisan_identity)

It seems the two relevant variables for partisan support are bjp_support and bjp_oppose, with 0 meaning FALSE and 1 TRUE.

Code
d_long |> 
  group_by(bjp_support, bjp_oppose) |> 
  summarize(n = n_distinct(caseid))
`summarise()` has grouped output by 'bjp_support'. You can override using the
`.groups` argument.
# A tibble: 3 × 3
# Groups:   bjp_support [2]
  bjp_support bjp_oppose     n
        <dbl>      <dbl> <int>
1           0          0  1003
2           0          1   887
3           1          0  1383

We make a single variable out of these, and match it with the news_slant variable.

Code
table(d_long$news_slant)

   Anti-BJP       MTurk Nationalist     Pro-BJP 
      26184       13092       26184       26184 
Code
d_long<- d_long %>% 
  # make a binary variable indicating political slant of news
  mutate(# make a clearer party id variable (goes from the most specific to the most general)
         partisan_identity = case_when(bjp_support == 0 & bjp_oppose == 0 ~ NA_character_,
                                       bjp_support == 0 ~ "non_BJP", 
                                       bjp_support == 1 ~ "BJP"),
         # combine party id and political slant 
         concordance = case_when(news_slant == "Pro-BJP" & partisan_identity == "BJP" ~ "concordant",
                                 news_slant == "Anti-BJP" & partisan_identity == "non_BJP" ~ "concordant", 
                                 news_slant == "Pro-BJP" & partisan_identity == "non_BJP" ~ "discordant",
                                 news_slant == "Anti-BJP" & partisan_identity == "BJP" ~ "discordant", 
                                 TRUE ~ NA_character_)
  )

# check
# d_long |>
#   select(partisan_identity, news_slant, concordance)

Identifiers (subject_id, experiment_id, country) and control_format

Check candidate variable for subject identifier.

Code
n_distinct(d_long$caseid)
[1] 3273

This corresponds to the number reported in the paper.

Code
d3 <- d_long |> 
  mutate(subject_id = caseid, 
         experiment_id = 3, 
         country = "India")

Combine and add identifiers (paper_id)

We combine both studies.

Code
## Combine + add remaining variables
guess_2020 <- bind_rows(d1, d2, d3) |> 
  mutate(paper_id = "guess_2020") |> 
  # add_intervention_info 
  bind_cols(intervention_info) |> 
  select(any_of(target_variables))

# check
guess_2020 |>
  group_by(paper_id, experiment_id) |>
  summarize(n_observations = n())
`summarise()` has grouped output by 'paper_id'. You can override using the
`.groups` argument.
# A tibble: 3 × 3
# Groups:   paper_id [1]
  paper_id   experiment_id n_observations
  <chr>              <dbl>          <int>
1 guess_2020             1          58884
2 guess_2020             2         104832
3 guess_2020             3          91644

Since in both Indian studies the same news have been used (with the same labels), we can just keep the labels in news_id.

news_selection

Write out data

Code
save_data(guess_2020)