Notes on Bulk Recoding in R

In the spirit of notes to myself, here’s a neat trick I learned to bulk recode lots of variables at once.

Suppose we have conducted a survey experiment and gotten lots of data from our participants. Our raw data looks like the following:

raw_data <-tibble(
  ID = LETTERS[1:15],
  var1 = sample(1:15, replace = F),
  var2 = sample(16:30, replace = F),
  var3 = sample(31:45, replace = F)
)
kable(raw_data)
ID var1 var2 var3
A 1 17 40
B 5 18 38
C 15 24 41
D 9 28 36
E 10 26 45
F 4 19 44
G 2 20 34
H 12 22 39
I 13 27 42
J 11 23 32
K 7 25 35
L 14 30 43
M 8 29 31
N 3 21 37
O 6 16 33

Because survey data is likely to contain errors, we have a subject matter expert look at the data. They come back with several changes that we need to make to the data. Helpfully, they put them into a data frame like the following.

fixes <- tibble(
  ID = c("A", "D", "F", "G"),
  original_var = c("var1", "var2", "var2", "var3"),
  original_response = as.integer(c(1, 28, 19, 34)),
  correct_response = as.integer(c(1000, 2800, 1900, 3400))
)
kable(fixes)
ID original_var original_response correct_response
A var1 1 1000
D var2 28 2800
F var2 19 1900
G var3 34 3400

Here we have a problem of updating several variables based on a set of (potentially different) conditions. Here, each variable needs to be updated based on an ID, but we do not need to update every variable/ID combination.

The brute force way of solving this problem (in the tidyverse) might look something like the following:

fix_data_b <- raw_data %>% 
  mutate(var1 = case_when(
              ID == "A"~as.integer(1000),
              TRUE~var1),
          var2 = case_when(
              ID == "D"~as.integer(2800),
              ID == "F"~as.integer(1900),
              TRUE~var2
        ),
          var3 = case_when(
              ID == "G"~as.integer(3400),
              TRUE~var3
        )
    )
kable(fix_data_b)
ID var1 var2 var3
A 1000 17 40
B 5 18 38
C 15 24 41
D 9 2800 36
E 10 26 45
F 4 1900 44
G 2 20 3400
H 12 22 39
I 13 27 42
J 11 23 32
K 7 25 35
L 14 30 43
M 8 29 31
N 3 21 37
O 6 16 33

A brute force solution is not so bad with only four fixes. What if we added another 25 fixes? Very quickly, this solution breaks down. Besides, there is a lot of unneeded typing, which increases the likelihood of making a mistake. Finally, there’s no way to show off that we have a neat trick to make people think we are brilliant, when in fact, we are very lazy.

The trick I learned to deal with this problem is to realize that this problem is the equivalent of updating a table based on a lookup table. As long as our data frames are in the right shape, we can make these updates all at once instead of having to type out each individually.

Step 1: Convert our raw data from wide to long.

raw_data_l <- raw_data %>% 
  pivot_longer(cols = -ID,
               names_to = "var_name",
               values_to = "original_value")
kable(head(raw_data_l))
ID var_name original_value
A var1 1
A var2 17
A var3 40
B var1 5
B var2 18
B var3 38

Step 2: Join our lookup table to our converted raw data.

raw_data_l %>% 
  left_join(fixes, by = c("ID", "var_name" = "original_var"))%>%
  head()%>%
  kable()
ID var_name original_value original_response correct_response
A var1 1 1 1000
A var2 17 NA NA
A var3 40 NA NA
B var1 5 NA NA
B var2 18 NA NA
B var3 38 NA NA

Note that we now a lot of NAs for each variable that does not require a correction. This is good! We can take advantage of that in Step 3.

Step 3: Make the computer do the work of recoding the variables.

raw_data_l %>% 
  left_join(fixes, by = c("ID", "var_name" = "original_var"))%>%
  mutate(corrected_value = if_else(is.na(original_response),
                                   original_value,
                                   correct_response))%>%
  select(ID, var_name, original_value, corrected_value)%>%
  filter(ID %in% fixes$ID)%>%
  kable()
ID var_name original_value corrected_value
A var1 1 1000
A var2 17 17
A var3 40 40
D var1 9 9
D var2 28 2800
D var3 36 36
F var1 4 4
F var2 19 1900
F var3 44 44
G var1 2 2
G var2 20 20
G var3 34 3400

We can put all three steps together in a pipe workflow.

raw_data_l %>% 
  left_join(fixes, by = c("ID", "var_name" = "original_var"))%>%
  mutate(corrected_value = if_else(is.na(original_response),
                                   original_value,
                                   correct_response))%>%
  select(-original_value, -correct_response, -original_response)%>%
  pivot_wider(names_from = "var_name",
              values_from = "corrected_value")%>%
  kable()
ID var1 var2 var3
A 1000 17 40
B 5 18 38
C 15 24 41
D 9 2800 36
E 10 26 45
F 4 1900 44
G 2 20 3400
H 12 22 39
I 13 27 42
J 11 23 32
K 7 25 35
L 14 30 43
M 8 29 31
N 3 21 37
O 6 16 33

Now we have a solution that will scale to many recodes without typing out much code. All that is required is that we create our lookup table. Since we have to keep track of our recoded changes, either in our code or in a separate document, we might as well leverage that work to make our life easier.

Related